DeepAI
Log In Sign Up

Improving Nighttime Driving-Scene Segmentation via Dual Image-adaptive Learnable Filters

Semantic segmentation on driving-scene images is vital for autonomous driving. Although encouraging performance has been achieved on daytime images, the performance on nighttime images are less satisfactory due to the insufficient exposure and the lack of labeled data. To address these issues, we present an add-on module called dual image-adaptive learnable filters (DIAL-Filters) to improve the semantic segmentation in nighttime driving conditions, aiming at exploiting the intrinsic features of driving-scene images under different illuminations. DIAL-Filters consist of two parts, including an image-adaptive processing module (IAPM) and a learnable guided filter (LGF). With DIAL-Filters, we design both unsupervised and supervised frameworks for nighttime driving-scene segmentation, which can be trained in an end-to-end manner. Specifically, the IAPM module consists of a small convolutional neural network with a set of differentiable image filters, where each image can be adaptively enhanced for better segmentation with respect to the different illuminations. The LGF is employed to enhance the output of segmentation network to get the final segmentation result. The DIAL-Filters are light-weight and efficient and they can be readily applied for both daytime and nighttime images. Our experiments show that DAIL-Filters can significantly improve the supervised segmentation performance on ACDC_Night and NightCity datasets, while it demonstrates the state-of-the-art performance on unsupervised nighttime semantic segmentation on Dark Zurich and Nighttime Driving testbeds.

READ FULL TEXT VIEW PDF

page 1

page 4

page 6

page 7

page 8

page 9

page 10

page 11

12/06/2018

DSNet for Real-Time Driving Scene Semantic Segmentation

We focus on the very challenging task of semantic segmentation for auton...
10/02/2020

Adaptive Neural Layer for Globally Filtered Segmentation

This study is motivated by typical images taken during ultrasonic examin...
04/22/2021

DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation

Semantic segmentation of nighttime images plays an equally important rol...
08/14/2019

Benchmarking the Robustness of Semantic Segmentation Models

When designing a semantic segmentation module for a practical applicatio...
03/11/2020

Cars Can't Fly up in the Sky: Improving Urban-Scene Segmentation via Height-driven Attention Networks

This paper exploits the intrinsic features of urban-scene images and pro...
05/20/2020

Label Efficient Visual Abstractions for Autonomous Driving

It is well known that semantic segmentation can be used as an effective ...
07/12/2022

Wound Segmentation with Dynamic Illumination Correction and Dual-view Semantic Fusion

Wound image segmentation is a critical component for the clinical diagno...

I Introduction

Semantic segmentation aims to divide an image into several regions with the same object category. As a fundamental task in computer vision, semantic segmentation is widely used in autonomous driving 

[46], indoor navigation [35] and virtual reality [16]

. By taking advantage of the powerful feature presentation using convolutional neural networks, deep learning-based semantic segmentation methods 

[11, 18, 52, 41, 15, 33] have achieved encouraging results on the conventional daytime datasets [9, 4]. However, these methods generalize poorly to the case of adverse nighttime lighting, which is critical for real-world applications such as autonomous driving. In this work, we focus our attention on semantic segmentation tasks in nighttime driving scenarios.

Fig. 1: The visual results of different modules in our method. The IAPM module outputs clearer images with better brightness. The LGF module improves the segmentation performance on object boundaries.

There are two main challenges for nighttime driving-scene segmentation. One is the difficulty in obtaining the large-scale labeled nighttime datasets for the poor visual perception. To this end, several nighttime datasets have been developed recently [34, 30]. NightCity [34] contains 2,998 labeled nighttime driving-scene images and ACDC_Night [30] has 400 images, which can be used for supervised training. The other challenge is the exposure imbalance and motion blur in nighttime images, which is hard to deal with by the existing daytime segmentation methods. To tackle these challenges, some domain adaptation methods have been proposed to transfer the semantic segmentation models from daytime to nighttime without using labels in the nighttime domain. Domain adaptation network (DANNet) [44] employs adversarial learning for nighttime semantic segmentation, which adds an image relighting subnetwork before the segmentation network. This increases a large number of training parameters, which is not conducive to deployments. In [5, 28], the twilight domain is treated as a bridge to achieve the domain adaptation from daytime to nighttime. Moreover, some methods [28, 32, 25, 31] take an image transfer model as a pre-processing stage to stylize nighttime or daytime images so as to construct synthetic datasets. By involving the complicated image transfer networks between day and night, these methods are usually computational intensive. In particular, it is difficult for the image transfer networks to achieve the ideal transformation when the inter-domain gap is large.

The nighttime images captured in the driving scenes often contain both over-exposure and under-exposure parts, which seriously degrade the visual appearances and structures. Figure 1

(a) shows an example nighttime image with both over-exposure (street lights and car headlights) and under-exposure (background and trees) regions. Such uneven brightness deteriorates the image content and texture, making it difficult to accurately segment the object’s boundary. In digital imaging systems, retouching experts improve the image quality by tuning the hyperparameters of an image enhancement module, including white balance adjustment, gamma correction, exposure compensation, detail enhancement, tone mapping, and etc. To avoid manually tuning these parameters, “white-box” image-adaptive enhancement framework 

[13, 47, 49] is employed to improve the image quality.

To address the above issues, we propose an ingenious driving-scene semantic segmentation method to improve the performance via dual image-adaptive learnable filters (DIAL-Filters), including an image-adaptive processing module (IAPM) and a learnable guided filter (LGF) module. Specifically, we present a set of fully differentiable image filters (DIF) in IAPM module, whose hyperparameters are adaptively predicted by a small CNN-based parameter predictor (CNN-PP) according to the brightness, contrast and exposure information of the input image. Moreover, the LGF is suggested to enhance the output of segmentation network. A joint optimization scheme is introduced to learn the DIF, CNN-PP, segmentation network and LGF in an end-to-end manner. Additionally, we make use of both daytime and nighttime images to train the proposed network. By taking advantages of the CNN-PP network, our method is able to adaptively deal with images of different lighting. Figure 1 shows an example segmentation process of our proposed approach.

Part of the above mentioned image-adaptive filtering techniques has been used in the detection task in our previous conference paper [20]. Comparing to [20], we make the following new contributions in this work: 1) we extend the image-adaptive filtering methods to the nighttime segmentation task and achieve state-of-the-art results; 2) a learnable guided filter is proposed to improve the segmentation performance on object edge regions; 3) we develop both supervised and unsupervised segmentation frameworks.

The main contributions of this paper are summarized in threefold:

  • We propose a novel lightweight add-on module, called DIAL-Filters, which can be easily added to the existing models. It is able to significantly improve the segmentation performance on nighttime images by double enhancement before and after the segmentation network.

  • We train our image-adaptive segmentation model in an end-to-end manner, which ensures that CNN-PP can learn an appropriate DIF to enhance the image for segmentation and learn a LGF to preserve edges and details.

  • The supervised experiments show that the proposed method can significantly improve segmentation performance on ACDC_Night and NightCity datasets. The unsupervised experiments on Dark Zurich and Nighttime Driving testbeds show that our method achieves state-of-the-art performance for unsupervised nighttime semantic segmentation.

Ii Related Work

Ii-a Semantic Segmentation

Image semantic segmentation is essential to many visual understanding systems, whose performance on benchmark datasets has been greatly improved due to the development of Convolutional Neural Networks (CNNs). FCN [22] was considered as a milestone, which demonstrates the capability of training a deep network for semantic segmentation in an end-to-end manner on variable-size images. Multi-level-based methods [18, 52] employed the multi-scale analysis to extract the global context while preserving the low-level details. Moreover, the convolution layer was used to generate the final per-pixel predictions. DeepLab and its variants [2, 3] introduced Atrous Convolution and Atrous Spatial Pyramid Pooling to the segmentation network. In [14], a Scale-Adaptive Network was proposed to deal with the problem of objects with different scales. Some approaches [8, 43] either made use of knowledge distillation or designed small networks to trade-off between accuracy and inference speed.

All the above methods focus on segmentation in daytime conditions. In this paper, we pay attention to night-time scenes. To investigate the effectiveness of our proposed DIAL-Filters on nighttime driving-scene segmentation, we select three popular and widely used segmentation networks as baselines, including RefineNet [18], PSPNet [52] and DeepLabV2 [2].

Ii-B Image Adaptation

Image adaptation is widely used in both low-level and high-level tasks. For image enhancement, some traditional methods [26, 48, 40] adaptively calculate the parameters of image transformation according to the corresponding image features. Wang et al. [40] proposed an brightness adjustment function that adaptively tunes the enhancement parameters based on the illumination distribution characteristics of an input image. Methods in  [13, 47, 49] employed a small CNN to flexibly learn the hyperparameters of image transformation. Yu et al. [47]

utilized a small CNN to learn image-adaptive exposures with deep reinforcement learning and adversarial learning. Hu et al. 

[13] proposed a post-processing framework with a set of differentiable filters, where deep reinforcement learning (DRL) is used to generate the image operation and filter parameters according to the quality of the retouched image. For high-level detection task, Zhang et al. [51] presented an improved canny edge detection method that use mean values of gradient of entire image to adaptively select dual-threshold. IA-YOLO [20] proposed a light CNN to adaptively predict the filter’s parameters for better detection performance. Inspired by these methods, in this work we adopt image adaptation for segmentation in nighttime driving scenarios.

Ii-C Nighttime Driving-scene Semantic Segmentation

While most of the existing works focus on the “normal” scenarios with well-illuminated scenes, there are some works to address the challenging scenarios such as nighttime scenes. Domain adaptive methods [36, 50, 19, 1] have achieved encouraging performance in many tasks, such as classification, object detection, pedestrian identification and segmentation. Thus, some researchers employed domain adaption-based methods [5, 39, 28, 31] to transfer the model trained in normal scenes to the target domain. In [5], a progressive adaptation approach was proposed to transform from daytime to nighttime via the bridge of twilight time. Sakaridis et al. [28, 31] presented a guided curriculum adaptation method based on DMAda [5], which gradually adapt segmentation models from day to night using both the annotated synthetic images and unlabeled real images. However, the additional segmentation models for different domains in these gradual adaptation methods increase the computation cost significantly. Some studies [27, 32, 25] trained the additional style transfer networks, e.g., CycleGAN [53], to perform the day-to-night or night-to-day image transfer before training the semantic segmentation models. The disadvantage of these methods is that the performance of the subsequent segmentation network lies in highly dependent on the previous style transfer model.

Recently, Deng et al. [7] proposed a NightLab method, which includes a hardness detection module to find the hard region. Moreover, they combined the Swin-Transformer [21] and DeformConv [6] for nighttime segmentation. However, NightLab is very time consuming since it trains multiple deep models, e.g., detection network, regularized light adaptation module, Swin-Transformer and DeformConv network. Wu et al. [44, 45] proposed an unsupervised one-stage adaptation method, where an image relighting network is placed at the head of the segmentation network. Adversarial learning was employed to achieve domain alignment between the labeled daytime data and unlabeled nighttime data. Unfortunately, the additional RelightNet incurs a large number of parameters and computation.

In contrast to the above methods, we suggest an image-adaptive segmentation approach to nighttime segmentation by embedding the proposed DIAL-Filters into a segmentation network. Our method can also be trained on unsupervised domain adaption with adversarial loss, which demonstrates significant advantages in both performance and efficiency.

Fig. 2: The end-to-end training pipeline of our proposed supervised segmentation framework. Our method learns a segmentation network with a small CNN-based parameter predictor (CNN-PP), which employs the downsampled input image to predict the hyperparamters of filters in the DIF module. The input high-resolution images are processed by DIF to aid segmentation network for better segmentation performance.

Iii Dual Image-adaptive Learnable Filters

Driving-scene images captured in nighttime have poor visibility due to the weak illuminations, which lead to the difficulties in semantic segmentation. Since each image may contain both overexposed and underexposed regions, the key of alleviating the difficulty in nighttime segmentation is to deal with exposure difference. Therefore, we suggest a set of dual image-adaptive learnable filters (DIAL-Filters) to enhance the results before and after the segmentation network. As illustrated in Figure 2, the whole pipeline consists of an image-adaptive processing module (IAPM), a segmentation network and a learnable guided filter (LGF). The IAPM module includes a CNN-based parameters predictor (CNN-PP) and a set of differentiable image filters (DIF).

Iii-a Image-adaptive Processing Module

Iii-A1 Differentiable Image Filters

As in [13]

, the design of image filters should conform to the principle of differentiability and resolution-Independence. For the gradient-based optimization of CNN-PP, the filters should be differentiable in order to allow the network training by backpropagation. Since CNN may consume intensive computational resources to process high resolution images (e.g.,

), we learn the filter parameters from the downsampled low-resolution image of size . Moreover, the same filter is applied to the image of original resolution so that these filters are independent of image resolution.

Our proposed DIF consists of several differentiable filters with adjustable hyperparameters, including Exposure, Gamma, Contrast and Sharpen. As in [13], the standard color operators, such as Gamma, Exposure and Contrast, can be expressed as pixel-wise filters.

Pixel-wise Filters. The pixel-wise filters map an input pixel value into an output pixel value , in which represent the values of the three color channels of red, green and blue, respectively. The mapping functions of the three pixel-wise filters are listed in Table I, where the second column lists the parameters to be optimized in our approach. Exposure and Gamma are simple multiplication and power transformations. Obviously, these mapping functions are differentiable with respect to both the input image and their parameters.

Filter Parameters Mapping Function
Exposure : Exposure value
Gamma : Gamma value
Contrast : Contrast value
TABLE I: The mapping functions of pixel-wise filters

The differentiable contrast filters are designed with an input parameter to set the linear interpolation between the original image and the fully enhanced image. As shown in Table 

I, the definition of in the mapping function is as follows:

(1)
(2)
(3)

Sharpen Filter. Image sharpening can highlight the image details. Like the unsharpen mask technique [26], the sharpening process can be described as follows:

(4)

where is the input image. denotes Gaussian filter, and is a positive scaling factor. This sharpening operation is differentiable with respect to both and . Note that the sharpening degree can be tuned for better segmentation performance by optimizing .

Iii-A2 CNN-based Parameters Predictor (CNN-PP)

In camera image signal processing (ISP) pipeline, some adjustable filters are usually employed for image enhancement, whose hyperparameters are manually tuned by experienced engineers through visual inspection [24]

. Such a tuning process is very awkward and expensive to find the suitable parameters for a broad range of scenes. To address this limitation, we employ a small CNN as a parameter predictor to estimate the hyperparameters, which is very efficient.

The purpose of CNN-PP is to predict the DIF’s parameters by understanding the global content of the image, such as brightness, color and tone, as well as the degree of exposure. The downsampled image is sufficient to estimate such information, which can greatly save the computational cost. As in [49], we apply a small CNN-PP to the low-resolution version of the input image to predict the hyperparameters of DIF. Given an input image of any resolution, we simply use bilinear interpolation to downsample it into resolution. As shown in Figure 2, the CNN-PP network is composed of five convolutional blocks, one dropout layer with a rate of 0.5 and a fully-connected layer. Each convolutional block includes a

convolutional layer with stride 2 and a leaky Relu. The final fully-connected layer outputs the hyperparameters for the DIF module. The output channels of these five convolutional layers are 16, 32, 64, 128 and 128, respectively. The CNN-PP model contains only 278K parameters when the total number of DIF’s hyperparameters is 4.

Iii-B Learnable Guided Filter

Many recent approaches to high-level visual tasks cascade a guided filter behind their original architecture to improve the results [12, 42]. Guided filter [10] is a type of edge-preserving and gradient-preserving image operation, which makes use of the object boundary in the guidance image to detect the object saliency. It is able to suppress the saliency outside the objects, improving the down-streamed detection or segmentation performance.

Fig. 3: The pipeline of learnable guided filter (LGF). The LGF module takes the IAPM output and segmentation network result as inputs, and outputs the enhanced segmentation result. With , we can adaptively process each image for better segmentation performance with edge preservation.

The original guided filter has a guided map , an input image , and the output image . As in Eq. (5), it supposes that

is a linear transformation of

in a window centered at the pixel .

(5)

The () are some linear coefficients which are assumed to be constants in . We can obtain the final solution of () as follows [10]

(6)
(7)

where and

are the mean and variance of

in a window . is the number of pixels in , is a regularization parameter, and is the mean of in . When applying the linear transformation to each window , as shown in Eq. (8), we can obtain the filtering output by averaging all the possible values of :

(8)

where and are the average coefficients of all windows overlapping . To further enhance the segmentation results, we introduce a learnable guide filter behind the segmentation network. Figure 3 illustrates its architecture. The input is the output of the segmentation network, and the guided map is the output of . involves two convolutional layers with 64 and 19 output channels, containing only 1,491 parameters. The LGF module is trained along with other modules in an end-to-end manner, which ensures LGF to adaptively process each image for better segmentation performance with edge preservation.

Iv Nighttime Semantic Segmentation

The proposed DIAL-Filters are added to the segmentation network to form our nighttime segmentation method. As shown in Figure 2, we plugin the IAPM and LGF into the head and end of the segmentation network, respectively. Most of existing methods adopt the unsupervised domain adaption methods to deal with nighttime segmentation. To make a more comprehensive comparison, we propose both supervised and unsupervised segmentation frameworks based on DIAL-Filters in this paper.

Iv-a Supervised Segmentation with DIAL-Filters

Iv-A1 Framework

As illustrated in Figure 2, our supervised nighttime segmentation method consists of an IAPM module, a segmentation network and a LGF module. The IAPM module includes a CNN-based parameters predictor (CNN-PP) and a set of differentiable image filters (DIF). We firstly resize the input image into the size of , and feed it into CNN-PP to predict DIF’s parameters. Then, the image filtered by DIF is treated as the input for segmentation network. The preliminary segmentation image is filtered by LGF to obtain the final segmentation results. The whole pipeline is trained end-to-end with segmentation loss so that the CNN-PP is able to learn an appropriate DIF to enhance the image adaptively for better semantic segmentation.

Iv-A2 Segmentation Network

Following [44], we select three popular semantic segmentation networks in our method, including DeepLabV2 [2], RefineNet [18] and PSPNet [52]. All these methods are with the common ResNet-101 backbone [11].

Iv-A3 Re-weighting and Segmentation Loss

Since the numbers of pixels for different object categories in the driving-scene images are uneven, it is difficult for the network to learn the features for the categories of small-size objects. This leads to poor performance in predicting the pixels of small objects. We use a re-weighting scheme to improve the network’s attention to small-size objects. The re-weighting equation is as follows:

(9)

where represents the proportion of pixels which are annotated as category

in the labeled Cityscapes dataset. Obviously, the lower the value of

is, the higher the weight is assigned. Therefore, it facilitates the network to segment the categories of smaller-size objects. The weight is normalized as follows:

(10)

where and

are the mean and standard deviation of

, respectively. We set and by default during training.

We utilize the popular weighted cross-entropy loss to account for segmentation:

(11)

where is the -th channel of the segmentation result, is the weight set in Eq. (10). is the number of valid pixels in the corresponding segmentation labeled image, is the number of labeled categories in the Cityscapes dataset, and

denotes the one-hot encoding of the ground truth of the

-th category.

Fig. 4: The end-to-end training pipeline of the proposed unsupervised segmentation framework. Three images from daytime Cityscapes, Dark Zurich-Daytime and Dark Zurich-nighttime are input to the weight-sharing IAPM. Then, the enhanced outputs are fed into the weight-sharing segmentation network to obtain the preliminary segmentation results. Finally, the segmentation predictions are filtered by a weight-sharing LGF to get the final results. The corresponding IAPM outputs the guided map of LGF.

Iv-B Unsupervised Segmentation with DIAL-Filters

Iv-B1 Framework

The Dark Zurich [29] is a relatively comprehensive nighttime dataset for real-world driving scenarios, which contains the corresponding images of the same driving scenes at daytime, twilight and nighttime. There are three domain images in our unsupervised method, including a source domain and two target domains and , where , , and denote Cityscapes (daytime), Dark Zurich-D (daytime), and Dark Zurich-N (nighttime), respectively. As shown in Figure 4, our unsupervised nighttime segmentation framework employs the similar architecture to [44]. The proposed unsupervised framework consists of three training circuits, which executes domain adaption from labeled source domains to two target domains and through the weight-sharing IAPM module, segmentation network and LGF module. It is worthy mentioning that only the images in Cityscapes have the semantic labels during training.

Iv-B2 Discriminators

Following [37], we design the discriminators to distinguish whether the segmentation results are from the source domain of the target domains by applying adversarial learning. Specifically, there are two discriminators with the same structures in our model. Each of them involves five convolutional blocks with the output channel of . Each convolutional block includes a 4 × 4 convolution layer with a Leaky Relu. Except that the stride of the first two convolution layers is 2, the rest is 1. They are trained to distinguish whether the output is from or and from or , respectively.

Iv-B3 Objective Functions

When training the proposed end-to-end unsupervised framework, we use the total loss for generator and the corresponding adversarial loss for discriminator. The total loss consists of segmentation loss , static loss and adversarial loss .

Segmentation Loss: As in Eq. (11), we take the weighted cross-entropy loss as the segmentation loss. In particular, in our unsupervised framework, only the annotated source domain images are used to optimize this loss. We also set and during the unsupervised training process.

Static Loss: Considering the similarities between the daytime images in Dark Zurich-D and their corresponding nighttime images in Dark Zurich-N, we employ a static loss for the target domain nighttime images as in [44]. This supports pseudo pixel-level supervision for the static object categories, e.g., road, sidewalk, wall, vegetation, terrain and sky.

We first define as the target domain daytime segmentation result. represents the corresponding nighttime segmentation prediction. When calculating the static loss, we only pay attention to the channels corresponding to the static categories. Thus, we can obtain and , where is the number of the categories of static objects. We then obtain the re-weighted daytime segmentation result as the pseudo label by Eq. (10). Finally, the static loss is defined as below:

(12)

where is the number of valid pixels in the corresponding segmentation labeled map. denotes the likelihood map of the correct category, which is defined as follows:

(13)

The operation represents the one-hot encoding of the semantic pseudo ground truth , and is each position of the window centered at .

Adversarial Loss:

Generative adversarial training is widely used to align two domains. In this case, we use two discriminators to distinguish whether the segmentation prediction is from the source domain or the target domain. We employ the least-squares loss function 

[23] in our adversarial training. The adversarial loss is defined by:

(14)

where is the label for the source domain. Finally, we define the total loss of the generator (G) as follows:

(15)

where and are set to 1, 1 and 0.01, respectively, during training.

The loss functions of two discriminators and are defined as follows:

(16)
(17)

where is the label for the target domains.

V Experiments

In this section, we first present the experimental testbeds and evaluation metrics. Then, we perform both unsupervised and supervised experiments to investigate the effectiveness of our method in nighttime driving-scene semantic segmentation. For the supervised experiments, we evaluate our approach on three datasets, including Cityscapes 

[4], NightCity [34] and ACDC [30], which have ground truth with pixel-level semantic annotations. For the unsupervised tests, we perform a domain adaption from Cityscapes (with labels) to Dark Zurich [29].

Fig. 5: Visual segmentation results of our method and baseline model on NightCity_test images. All the methods are trainned on Cityscapes and NightCity.

V-a Datasets and Evaluation Metrics

For all experiments, we employ the mean of category-wise intersection-over-union (mIoU) as the evaluation metric. The following datasets are used for model training and performance evaluation.

V-A1 Cityscapes [4]

Cityscapes is a semantic understanding dataset focused on daytime urban street scenes, which is widely used as a benchmark dataset for segmentation tasks. It includes 19 categories of pixel-level annotations, and consists of 2,975 training images, 500 validation images and 1,525 testing images with resolution. In this work, we employ Cityscapes as the daytime labeled dataset in both supervised and unsupervised experiments.

V-A2 NightCity [34]

The NightCity is a large dataset of nighttime city driving scenes with pixel-level annotations, which can be used for supervised semantic segmentation. There are 2,998 images for traning, 1,299 images for validation or testing with pixel-level annotations of 19 categories. The labeled object classes are the same as the Cityscapes [4].

V-A3 Acdc [30]

ACDC is an adverse conditions dataset with the correspondences for semantic driving scene understanding relationship. It contains 4,006 images with high-quality pixel-level semantic annotations evenly distributed among the four common adverse conditions in real-world driving environments, namely fog, nighttime, rain and snow. Both the resolution and labeled categories are the same as Cityscapes 

[4]. The ACDC dataset contains 1,000 haze images, 1,006 nighttime images, 1,000 rain and 1,000 snow images for dense pixel-level semantic annotation. We use the ACDC_night as our supervised experimental dataset, which consists of 400 training, 106 validation and 500 test images.

V-A4 Dark Zurich [29]

Dark Zurich is a large dataset with urban driving scenes designed for unsupervised semantic segmentation. It includes 2,416 nighttime images, 2,920 twilight images and 3,041 daytime images for training, which are all unlabeled with resolution of . These images are captured in the same scenes during daytime, twilight and nighttime so that they can be aligned by image features. In this work, we only employ 2,416 night-day image pairs to train our unsupervised model. There are also 201 nighttime images with pixel-annotation in the Dark Zurich dataset, involving 50 images for validation (Dark Zurich-val) and 151 images for testing (Dark Zurich-test), which can be used for quantitative evaluation. The Dark Zurich-test dataset provides only one validation channel via the official website. We obtain the mIoU result of our proposed approach on Dark Zurich-test by submitting the segmentation predictions to the online evaluation website.

V-A5 Nighttime Driving [5]

The Nighttime Driving dataset  [5] includes 50 nighttime driving-scene images with resolution of . As in  [4], the images in this set are all labeled with the same 19 classes. In this work, we adopt the Nighttime Driving dataset only for testing.

V-B Supervised Segmentation with DIAL-Filters

V-B1 Experimental Setup

We adopt several typical backbone networks, including DeepLabV2 [2], RefineNet [18] and PSPNet [52], to verify the generalization capability of DIAL-Filters. Following [44]

, all experiments utilize the semantic segmentation models that are pre-trained on Cityscapes for 150,000 epochs. The mIoU of pre-trained DeepLabV2, RefineNet and PSPNet on Cityscapes validation set are 66.37, 65.85 and 63.94, respectively. During training, we employ random cropping with size of

on the scale between 0.5 and 1.0, and apply random horizontal flipping to expand the training dataset. As in [2, 44]

, we train our model using the Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9 and a weight decay of

. The initial learning rate is set to

, and then we employ the poly learning rate policy to decrease it with a power of 0.9. The batch size is set to 4. We conducted our experiments on Tesla V100 GPU, and our approach is implemented by PyTorch.

Methods mIoU (%) on N_t mIoU (%) on C_v
C C + N C C + N
DeepLabV2 [2] 18.20 46.39 66.37 65.65
Ours (DeepLabV2) - 48.24 - 66.57
PSPNet [52] 20.65 47.29 63.94 65.87
Ours (PSPNet) - 49.73 - 66.59
RefineNet [18] 22.92 48.70 65.85 65.68
Ours (RefineNet) - 51.21 - 67.15
TABLE II: Comparison of our method and baseline models on the NightCity test set. “C”: Trained on Cityscapes. “C + N”: Trained on Cityscapes and NightCity. “N_t”: NightCity test set. “C_v”: Cityscapes validation set.
Fig. 6: Visual segmentation results of our method and baseline model on ACDC_night_test images. All the methods are trainned on Cityscapes and ACDC_night.
Fig. 7: An example of the processing pipeline of our method. For better illustration, the filtered results are normalized.

V-B2 Experiments on Cityscapes and NightCity Datasets

To demonstrate the effectiveness of our proposed method, we plugin DIAL-Filters into three classic semantic segmentation networks, and perform experiments on three labeled datasets. Table II reports the quantitative results of the existing methods and the proposed approach trained on Cityscapes (“C” columns) or hybrid datasets (“C+N” columns), respectively. With training on hybrid datasets (Cityscapes and NightCity) and validated on NightCity_test, our method outperforms DeepLabV2, PSPNet and RefineNet by 1.85%, 2.44% and 2.41%, respectively. Compared with these methods trained on daytime Cityscapes, our method can still improve them by 0.20%, 2.65% and 1.30% on daytime Cityscapes validation set, while the baseline models of hybrid data training have less improvement or even become worse. This demonstrates that the IAPM module is able to adaptively process the image with different illumination for better semantic segmentation. Figure 5 shows several visual examples of our method and the baseline PSPNet (trained on “C+N”). It can be observed that our method has better segmentation performance on the categories that are overlooked by other methods at night, such as pole and traffic sign.

V-B3 Experiments on Cityscapes and ACDC_night Datasets

We examine the effectiveness of the proposed method on the hybrid datasets of Cityscapes and ACDC_night. As depicted in Table III, our proposed DIAL-Filters with any of the three backbones performs better than the baseline models on ACDC_night test dataset. Figure 6 shows the qualitative comparisons between our method and the baseline RefineNet. It can be observed that the presented IPAM module is able to reveal more objects by adaptively increasing the brightness and contrast of input image, which are essential to semantic segmentation in the region of small objects. Figure 7 illustrates how the CNN-PP module predicts DIF’s parameters, including the detailed parameter values and the images processed by each sub-filter. After the input image is processed by the learned DIF module, more image details are revealed, which are conducive to the subsequent segmentation task.

Fig. 8: Qualitative comparisons of our approach with some methods on three samples from Dark Zurich-val. All the methods perform domain adaption from Cityscapes to Dark Zurich.
Methods mIoU (%) on A_t mIoU (%) on C_v
C C + A C C + A
DeepLabV2 [2] 30.06 53.31 66.37 64.97
Ours(DeepLabV2) - 55.78 - 65.55
PSPNet [52] 26.62 56.69 63.94 65.18
Ours(PSPNet) - 58.42 - 66.75
RefineNet [18] 29.05 57.69 65.85 63.19
Ours(RefineNet) - 60.06 - 66.19
TABLE III: Comparison of our approach and baseline models on the ACDC_night test set. “C”: Trained on Cityscapes. “C + A”: Trained on Cityscapes and ACDC_night. “A_t”: ACDC_night test set. “C_v”: Cityscapes validation set.

V-C Unsupervised Segmentation with DIAL-Filters

V-C1 Experimental Setup

As in the supervised experiments, we employ DeepLabV2 [2], RefineNet [18] and PSPNet [52] as baseline models to perform unsupervised segmentation experiments. The proposed model is trained by the Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9 and a weight decay of . Like [44], we employ Adam optimizer to train the discriminators with set to . The learning rate of the discriminators is set to . Moreover, we apply random cropping with the crop size of 512 on the scale between 0.5 and 1.0 for the Cityscapes dataset, and the crop size is set to 960 on the scale between 0.9 and 1.1 on the Dark Zurich dataset. In addition, random horizontal flipping is used in the training. The other related settings are consistent with the supervised experiments.

Method

road

sidewalk

building

wall

fence

pole

traffic light

traffic sign

vegetation

terrain

sky

person

rider

car

truck

bus

train

motorcycle

bicycle

mIoU
RefineNet [18]-Cityscapes 68.8 23.2 46.8 20.8 12.6 29.8 30.4 26.9 43.1 14.3 0.3 36.9 49.7 63.6 6.8 0.2 24.0 33.6 9.3 28.5
DeepLabV2 [2]-Cityscapes 79.0 21.8 53.0 13.3 11.2 22.5 20.2 22.1 43.5 10.4 18.0 37.4 33.8 64.1 6.4 0.0 52.3 30.4 7.4 28.8
PSPNet [52]-Cityscapes 78.2 19.0 51.2 15.5 10.6 30.3 28.9 22.0 56.7 13.3 20.8 38.2 21.8 52.1 1.6 0.0 53.2 23.2 10.7 28.8
AdaptSegNet-CityscapesDZ [37] 86.1 44.2 55.1 22.2 4.8 21.1 5.6 16.7 37.2 8.4 1.2 35.9 26.7 68.2 45.1 0.0 50.1 33.9 15.6 30.4
ADVENT-CityscapesDZ [38] 85.8 37.9 55.5 27.7 14.5 23.1 14.0 21.1 32.1 8.7 2.0 39.9 16.6 64.0 13.8 0.0 58.8 28.5 20.7 29.7
BDL-CityscapesDZ [17] 85.3 41.1 61.9 32.7 17.4 20.6 11.4 21.3 29.4 8.9 1.1 37.4 22.1 63.2 28.2 0.0 47.7 39.4 15.7 30.8
DMAda [5] 75.5 29.1 48.6 21.3 14.3 34.3 36.8 29.9 49.4 13.8 0.4 43.3 50.2 69.4 18.4 0.0 27.6 34.9 11.9 32.1
GCMA [28] 81.7 46.9 58.8 22.0 20.0 41.2 40.5 41.6 64.8 31.0 32.1 53.5 47.5 75.5 39.2 0.0 49.6 30.7 21.0 42.0
MGCDA [31] 80.3 49.3 66.2 7.8 11.0 41.4 38.9 39.0 64.1 18.0 55.8  52.1 53.5 74.7 66.0 0.0 37.5 29.1 22.7 42.5
DANNet (DeepLabV2) [44] 88.6 53.4 69.8 34.0 20.0 25.0 31.5 35.9 69.5  32.2 82.3 44.2 43.7 54.1 22.0 0.1 40.9 36.0 24.1 42.5
DANNet (RefineNet) [44] 90.0 54.0 74.8 41.0 21.1 25.0 26.8 30.2 72.0 26.2 84.0 47.0 33.9 68.2 19.0 0.3 66.4 38.3 23.6 44.3
DANNet (PSPNet) [44] 90.4 60.1 71.0 33.6 22.9 30.6 34.3 33.7 70.5 31.8 80.2 45.7 41.6 67.4 16.8 0.0 73.0 31.6 22.9 45.2
Ours (DeepLabV2) 88.7 55.8 69.8 34.7 17.1 31.7 26.6 34.4 69.0 25.9 80.1 45.1 43.3 67.6 10.9 1.1 66.1 37.6 20.5 43.5
Ours (RefineNet) 90.4 62.5 73.1 34.4 21.5 35.7 27.7 32.1 70.3 35.6 81.7 45.0 43.7 70.3 8.2 0.0 69.2 38.0 18.2 45.1
Ours (PSPNet) 90.6 60.8 70.9 40.2 21.1 39.6 34.4 38.3 73.2 30.2 72.9 48.6 41.6 72.8 8.8 0.0 74.6 33.0 22.8 46.0
TABLE IV: The per-category results on Dark Zurich-test by current state-of-the-art methods and our method. CityscapesDZ denotes the adaptation from Cityscapes to Dark Zurich-night. The best results are presented in BOLD.
Fig. 9: An example of the processing pipeline of our method. For better illustration, the filtered results are normalized.

V-C2 Comparison with state-of-the-art methods

We compare the proposed approach with state-of-the-art unsupervised segmentation methods, including DANNet [44], MGCDA [31], GCMA [28], DMAda [5] and several domain adaptation methods [37, 38, 17] on Dark Zurich-test and Night Driving to demonstrate the efficacy of our method. All these competing methods adopt the ResNet-101 backbone. Specifically, both our method and DANNet are tested with three baseline models. MGCDA, GCMA, and DMAda are tested with the baseline RefineNet, while the rest are based on DeepLabV2.

Experimental Results on Dark Zurich-test. Table IV reports the quantitative results on Dark Zurich-test dataset. Comparing to the state-of-the-art nighttime segmentation methods, our proposed DIAL-Filters with PSPNet achieves the highest mIoU score. It is worthy mentioning that although our model is smaller, it outperforms DANNet on all the three baseline models. It can be found that our DIAL-Filters with either DeepLabV2, RefineNet or PSPNet achieves better performance than the domain adaptation methods (see the second panel in Table IV). Figure 8 shows several visual comparison examples of MGCDA, DANNet and our method. With the proposed DIAL-Filters, our adaptive module is able to distinguish objects of interest from the images, especially small objects and confusing areas with mixed categories in the dark. Figure 9 shows an example on how the CNN-PP module predicts DIF’s parameters, including the detailed parameter values and the images processed by each sub-filter. It can be observed that our proposed DIAL-Filters are able to increase the brightness of the input image and reveal the image details, which is essential to segment the nighttime images.

Experimental Results on Night Driving. Table V reports the mIoU results on Night Driving test dataset. In contrast to the state-of-the-art nighttime segmentation methods, our DIAL-Filters with PSPNet achieves the best performance. Though our model is smaller, it outperforms DANNet by 2.21%, 1.96% and 2.62%, respectively, when RefineNet, DeepLabV2 and PSPNet are used as baselines. In addition, it can be clearly seen that our method achieves better performance than the domain adaptation methods.

Method mIoU
RefineNet [18]-Cityscapes 32.75
DeepLabV2 [2]-Cityscapes 25.44
PSPNet [52]-Cityscapes 27.65
AdaptSegNet-CityscapesDZ-night [37] 34.5
ADVENT-CityscapesDZ-night [38] 34.7
BDL-CityscapesDZ-night [17] 34.7
DMAda [5] 36.1
GCMA [28] 45.6
MGCDA [31] 49.4
DANNet (RefineNet) 42.36
DANNet (DeepLabV2) 44.98
DANNet (PSPNet) 47.70
Ours (RefineNet) 44.57
Ours (DeepLabV2) 46.94
Ours (PSPNet) 50.32
TABLE V: Comparison of our approach with the existing state-of-the-art methods on Nighttime Driving test set [5].

V-D Ablation Study

To examine the effectiveness of each module in our proposed framework, including IAPM, LGF and DIF, we conduct ablation experiments with different settings, . All experiments are trained on the mixed datasets of Cityscapes and NightCity in a supervised manner, where the weight parameters are pre-trained 150,000 epochs on Cityscapes.

Table VI shows the experimental results. We select RefineNet (ResNet-101) as the base model, and ’DIAL-Filters’ is the full model of our method. The settings and training data are the same for all experiments. It can be seen that DIF preprocessing, LGF postprocessing and image adaptive IAPM all improve the segmentation performance. The RefineNet_deep is a deeper version of RefineNet, whose backbone is ResNet-152 with 15,644K more learnable parameters than ResNet-101. Our proposed approach performs better than RefineNet_deep with only 280K additional parameters in CNN-PP and LGF. The method with fixed DIF means that the filter’s hyperparameters are a set of given values, all of which are within a reasonable range. Clearly, our DIAL-Filters method achieves the best performance on both NightCity_test and Cityscapes_test, which indicates that our method can adaptively process both the daytime images and nighttime ones. This is essential to the down-streamed segmentation tasks. Moreover, the LGF for postprocessing can further boost the performance. Figure 10 shows the visual results with/without LGF. It can be seen that the learnable guided filter obtains more precise segmentation boundaries of small objects.

Fig. 10: Visual results of w/ and w/o the LGF module on a sample from NightCity_test by our method (RefineNet).
Method mIoU (%) on N_t mIoU (%) on C_v
RefineNet 48.70 65.68
RefineNet_deep 49.80 66.43
w/o IAPM 48.86 66.33
w/o LGF 50.97 65.93
w/ fixed DIF 49.17 65.77
DIAL-Filters 51.21 67.15
TABLE VI: Ablation study on the variants of our DIAL-Filters (RefineNet) on NightCity_test and Cityscaes_val.
Method Additional Params Speed (ms)
RefineNet / 20
RefineNet_deep 15,644K 27
DANNet (RefineNet) 4,299K 23
Ours (RefineNet) 280K 24
TABLE VII: Efficiency analysis on the compared methods.

V-E Efficiency Analysis

In our proposed framework, we introduce a set of novel learnable DIAL-Filters with 280K trainable parameters into a segmentation network. CNN-PP has five convolutional layers, a dropout layer and a fully connected layer, and LGF includes two convolutional layers. Based on RefineNet, Table VII compares the efficiency of some methods used in our experiments. All these methods deploy an add-on module into RefineNet. The second column lists the number of additional parameters over the RefineNet model. The third column lists the running time on a color image with of size using a single Tesla V100 GPU. It can be observed that our method only adds 280K trainable parameters over RefineNet while achieving the best performance on all experiments with comparable running time. Note that though our method has fewer trainable parameters than DANNet, its running time is slightly longer. This is because the filtering process in the DIF module incurs extra computation.

Vi Conclusion

In this paper, we proposed a novel approach to semantic segmentation in nighttime driving conditions by adaptively enhancing each input image to obtain better performance. Specifically, we introduced dual image-adaptive learnable filters (DIAL-Filters) and embedded them into the head and end of a segmentation network. A fully differentiable image processing module was developed to preprocess the input image, whose hyperparameters were predicted by a small convolutional neural network. The preliminary segmentation results were further enhanced by learnable guided filtering for more accurate segmentation. The whole framework was trained in an end-to-end fashion, where the parameter prediction network was weakly supervised to learn an appropriate DIF module through the segmentation loss in the supervised experiments. Our experiments on both supervised and unsupervised segmentation demonstrated the superiority of the proposed DIAL-Filters to previous nighttime driving-scene semantic segmentation methods.

References

  • [1] Q. Bo, W. Ma, Y. Lai, and H. Zha (2022) All-higher-stages-in adaptive context aggregation for semantic edge detection. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §II-C.
  • [2] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI 40 (4), pp. 834–848. Cited by: §II-A, §II-A, §IV-A2, §V-B1, §V-C1, TABLE II, TABLE III, TABLE IV, TABLE V.
  • [3] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, pp. 801–818. Cited by: §II-A.
  • [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §I, §V-A1, §V-A2, §V-A3, §V-A5, §V.
  • [5] D. Dai and L. Van Gool (2018) Dark model adaptation: semantic image segmentation from daytime to nighttime. In IEEE Intelligent Transportation Systems Conference, pp. 3819–3824. Cited by: §I, §II-C, §V-A5, §V-A5, §V-C2, TABLE IV, TABLE V.
  • [6] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 764–773. Cited by: §II-C.
  • [7] X. Deng, P. Wang, X. Lian, and S. Newsam (2022) NightLab: a dual-level architecture with hardness detection for segmentation at night. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16938–16948. Cited by: §II-C.
  • [8] Y. Feng, X. Sun, W. Diao, J. Li, and X. Gao (2021) Double similarity distillation for semantic image segmentation. IEEE Trans. on Image Processing 30 (), pp. 5363–5376. Cited by: §II-A.
  • [9] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §I.
  • [10] K. He, J. Sun, and X. Tang (2012) Guided image filtering. IEEE transactions on pattern analysis and machine intelligence 35 (6), pp. 1397–1409. Cited by: §III-B, §III-B.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §I, §IV-A2.
  • [12] P. Hu, B. Shuai, J. Liu, and G. Wang (2017) Deep level sets for salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2300–2309. Cited by: §III-B.
  • [13] Y. Hu, H. He, C. Xu, B. Wang, and S. Lin (2018) Exposure: a white-box photo post-processing framework. ACM Transactions on Graphics (TOG) 37 (2), pp. 26. Cited by: §I, §II-B, §III-A1, §III-A1.
  • [14] Z. Huang, C. Wang, X. Wang, W. Liu, and J. Wang (2020) Semantic image segmentation by scale-adaptive networks. IEEE Trans. on Image Processing 29 (), pp. 2066–2077. Cited by: §II-A.
  • [15] J. Ji, R. Shi, S. Li, P. Chen, and Q. Miao (2020) Encoder-decoder with cascaded crfs for semantic segmentation. IEEE Transactions on Circuits and Systems for Video Technology 31 (5), pp. 1926–1938. Cited by: §I.
  • [16] C. Li and M. Wand (2016) Combining markov random fields and convolutional neural networks for image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2479–2486. Cited by: §I.
  • [17] Y. Li, L. Yuan, and N. Vasconcelos (2019) Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6936–6945. Cited by: §V-C2, TABLE IV, TABLE V.
  • [18] G. Lin, A. Milan, C. Shen, and I. Reid (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1925–1934. Cited by: §I, §II-A, §II-A, §IV-A2, §V-B1, §V-C1, TABLE II, TABLE III, TABLE IV, TABLE V.
  • [19] K. Lin, L. Wang, K. Luo, Y. Chen, Z. Liu, and M. Sun (2020) Cross-domain complementary learning using pose for multi-person part segmentation. IEEE Transactions on Circuits and Systems for Video Technology 31 (3), pp. 1066–1078. Cited by: §II-C.
  • [20] W. Liu, G. Ren, R. Yu, S. Guo, J. Zhu, and L. Zhang (2022) Image-adaptive yolo for object detection in adverse weather conditions. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Cited by: §I, §II-B.
  • [21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. Cited by: §II-C.
  • [22] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §II-A.
  • [23] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley (2017)

    Least squares generative adversarial networks

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2794–2802. Cited by: §IV-B3.
  • [24] A. Mosleh, A. Sharma, E. Onzon, F. Mannan, N. Robidoux, and F. Heide (2020) Hardware-in-the-loop end-to-end optimization of camera image processing pipelines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7529–7538. Cited by: §III-A2.
  • [25] S. Nag, S. Adak, and S. Das (2019) What’s there in the dark. In International Conference on Image Processing, pp. 2996–3000. Cited by: §I, §II-C.
  • [26] A. Polesel, G. Ramponi, and V. J. Mathews (2000) Image enhancement via adaptive unsharp masking. IEEE Transactions on Image Processing 9 (3), pp. 505–510. Cited by: §II-B, §III-A1.
  • [27] E. Romera, L. M. Bergasa, K. Yang, J. M. Alvarez, and R. Barea (2019) Bridging the day and night domain gap for semantic segmentation. In 2019 IEEE Intelligent Vehicles Symposium, pp. 1312–1318. Cited by: §II-C.
  • [28] C. Sakaridis, D. Dai, and L. V. Gool (2019) Guided curriculum model adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7374–7383. Cited by: §I, §II-C, §V-C2, TABLE IV, TABLE V.
  • [29] C. Sakaridis, D. Dai, and L. Van Gool (2019) Guided curriculum model adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §IV-B1, §V-A4, §V.
  • [30] C. Sakaridis, D. Dai, and L. Van Gool (2021) ACDC: the adverse conditions dataset with correspondences for semantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10765–10775. Cited by: §I, §V-A3, §V.
  • [31] C. Sakaridis, D. Dai, and L. Van Gool (2021) Map-guided curriculum domain adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence. Cited by: §I, §II-C, §V-C2, TABLE IV, TABLE V.
  • [32] L. Sun, K. Wang, K. Yang, and K. Xiang (2019) See clearer at night: towards robust nighttime semantic segmentation through day-night image conversion. In

    Artificial Intelligence and Machine Learning in Defense Applications

    ,
    Vol. 11169, pp. 111690A. Cited by: §I, §II-C.
  • [33] X. Sun, C. Chen, X. Wang, J. Dong, H. Zhou, and S. Chen (2021) Gaussian dynamic convolution for efficient single-image segmentation. IEEE Transactions on Circuits and Systems for Video Technology 32 (5), pp. 2937–2948. Cited by: §I.
  • [34] X. Tan, K. Xu, Y. Cao, Y. Zhang, L. Ma, and R. W. Lau (2021) Night-time scene parsing with a large real dataset. IEEE Transactions on Image Processing 30, pp. 9085–9098. Cited by: §I, §V-A2, §V.
  • [35] D. Teso-Fz-Betoño, E. Zulueta, A. Sánchez-Chica, U. Fernandez-Gamiz, and A. Saenz-Aguirre (2020) Semantic segmentation to develop an indoor navigation system for an autonomous mobile robot. Mathematics 8 (5), pp. 855. Cited by: §I.
  • [36] Y. Tian and S. Zhu (2021) Partial domain adaptation on semantic segmentation. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §II-C.
  • [37] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7472–7481. Cited by: §IV-B2, §V-C2, TABLE IV, TABLE V.
  • [38] T. Vu, H. Jain, M. Bucher, M. Cord, and P. Pérez (2019) Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2517–2526. Cited by: §V-C2, TABLE IV, TABLE V.
  • [39] Q. Wang, J. Gao, and X. Li (2019) Weakly supervised adversarial domain adaptation for semantic segmentation in urban scenes. IEEE Trans. on Image Processing 28 (9), pp. 4376–4386. Cited by: §II-C.
  • [40] W. Wang, Z. Chen, X. Yuan, and F. Guan (2021) An adaptive weak light image enhancement method. In Twelfth International Conference on Signal Processing Systems, Vol. 11719, pp. 1171902. Cited by: §II-B.
  • [41] X. Weng, Y. Yan, S. Chen, J. Xue, and H. Wang (2021) Stage-aware feature alignment network for real-time semantic segmentation of street scenes. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §I.
  • [42] H. Wu, S. Zheng, J. Zhang, and K. Huang (2018) Fast end-to-end trainable guided filter. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1838–1847. Cited by: §III-B.
  • [43] T. Wu, S. Tang, R. Zhang, J. Cao, and Y. Zhang (2021) CGNet: a light-weight context guided network for semantic segmentation. IEEE Transactions on Image Processing 30 (), pp. 1169–1179. Cited by: §II-A.
  • [44] X. Wu, Z. Wu, H. Guo, L. Ju, and S. Wang (2021) DANNet: a one-stage domain adaptation network for unsupervised nighttime semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15769–15778. Cited by: §I, §II-C, §IV-A2, §IV-B1, §IV-B3, §V-B1, §V-C1, §V-C2, TABLE IV.
  • [45] X. Wu, Z. Wu, L. Ju, and S. Wang (2021) A one-stage domain adaptation network with image alignment for unsupervised nighttime semantic segmentation. IEEE Transactions on Pattern Analysis & Machine Intelligence (01), pp. 1–1. Cited by: §II-C.
  • [46] L. Yang, X. Liang, T. Wang, and E. Xing (2018) Real-to-virtual domain unification for end-to-end autonomous driving. In Proceedings of the European Conference on Computer Vision, pp. 530–545. Cited by: §I.
  • [47] R. Yu, W. Liu, Y. Zhang, Z. Qu, D. Zhao, and B. Zhang (2018) Deepexposure: learning to expose photos with asynchronously reinforced adversarial learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 2153–2163. Cited by: §I, §II-B.
  • [48] Z. Yu and C. Bajaj (2004) A fast and adaptive method for image contrast enhancement. In 2004 International Conference on Image Processing, 2004. ICIP’04., Vol. 2, pp. 1001–1004. Cited by: §II-B.
  • [49] H. Zeng, J. Cai, L. Li, Z. Cao, and L. Zhang (2020) Learning image-adaptive 3d lookup tables for high performance photo enhancement in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §II-B, §III-A2.
  • [50] L. Zhang, P. Wang, W. Wei, H. Lu, C. Shen, A. van den Hengel, and Y. Zhang (2018) Unsupervised domain adaptation using robust class-wise matching. IEEE Transactions on Circuits and Systems for Video Technology 29 (5), pp. 1339–1349. Cited by: §II-C.
  • [51] L. Zhang (2015) Image adaptive edge detection based on canny operator and multiwavelet denoising. In 2014 International Conference on Computer Science and Electronic Technology, pp. 335–338. Cited by: §II-B.
  • [52] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §I, §II-A, §II-A, §IV-A2, §V-B1, §V-C1, TABLE II, TABLE III, TABLE IV, TABLE V.
  • [53] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    .
    In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2242–2251. Cited by: §II-C.