Semantic segmentation aims to divide an image into several regions with the same object category. As a fundamental task in computer vision, semantic segmentation is widely used in autonomous driving, indoor navigation  and virtual reality 
. By taking advantage of the powerful feature presentation using convolutional neural networks, deep learning-based semantic segmentation methods[11, 18, 52, 41, 15, 33] have achieved encouraging results on the conventional daytime datasets [9, 4]. However, these methods generalize poorly to the case of adverse nighttime lighting, which is critical for real-world applications such as autonomous driving. In this work, we focus our attention on semantic segmentation tasks in nighttime driving scenarios.
There are two main challenges for nighttime driving-scene segmentation. One is the difficulty in obtaining the large-scale labeled nighttime datasets for the poor visual perception. To this end, several nighttime datasets have been developed recently [34, 30]. NightCity  contains 2,998 labeled nighttime driving-scene images and ACDC_Night  has 400 images, which can be used for supervised training. The other challenge is the exposure imbalance and motion blur in nighttime images, which is hard to deal with by the existing daytime segmentation methods. To tackle these challenges, some domain adaptation methods have been proposed to transfer the semantic segmentation models from daytime to nighttime without using labels in the nighttime domain. Domain adaptation network (DANNet)  employs adversarial learning for nighttime semantic segmentation, which adds an image relighting subnetwork before the segmentation network. This increases a large number of training parameters, which is not conducive to deployments. In [5, 28], the twilight domain is treated as a bridge to achieve the domain adaptation from daytime to nighttime. Moreover, some methods [28, 32, 25, 31] take an image transfer model as a pre-processing stage to stylize nighttime or daytime images so as to construct synthetic datasets. By involving the complicated image transfer networks between day and night, these methods are usually computational intensive. In particular, it is difficult for the image transfer networks to achieve the ideal transformation when the inter-domain gap is large.
The nighttime images captured in the driving scenes often contain both over-exposure and under-exposure parts, which seriously degrade the visual appearances and structures. Figure 1
(a) shows an example nighttime image with both over-exposure (street lights and car headlights) and under-exposure (background and trees) regions. Such uneven brightness deteriorates the image content and texture, making it difficult to accurately segment the object’s boundary. In digital imaging systems, retouching experts improve the image quality by tuning the hyperparameters of an image enhancement module, including white balance adjustment, gamma correction, exposure compensation, detail enhancement, tone mapping, and etc. To avoid manually tuning these parameters, “white-box” image-adaptive enhancement framework[13, 47, 49] is employed to improve the image quality.
To address the above issues, we propose an ingenious driving-scene semantic segmentation method to improve the performance via dual image-adaptive learnable filters (DIAL-Filters), including an image-adaptive processing module (IAPM) and a learnable guided filter (LGF) module. Specifically, we present a set of fully differentiable image filters (DIF) in IAPM module, whose hyperparameters are adaptively predicted by a small CNN-based parameter predictor (CNN-PP) according to the brightness, contrast and exposure information of the input image. Moreover, the LGF is suggested to enhance the output of segmentation network. A joint optimization scheme is introduced to learn the DIF, CNN-PP, segmentation network and LGF in an end-to-end manner. Additionally, we make use of both daytime and nighttime images to train the proposed network. By taking advantages of the CNN-PP network, our method is able to adaptively deal with images of different lighting. Figure 1 shows an example segmentation process of our proposed approach.
Part of the above mentioned image-adaptive filtering techniques has been used in the detection task in our previous conference paper . Comparing to , we make the following new contributions in this work: 1) we extend the image-adaptive filtering methods to the nighttime segmentation task and achieve state-of-the-art results; 2) a learnable guided filter is proposed to improve the segmentation performance on object edge regions; 3) we develop both supervised and unsupervised segmentation frameworks.
The main contributions of this paper are summarized in threefold:
We propose a novel lightweight add-on module, called DIAL-Filters, which can be easily added to the existing models. It is able to significantly improve the segmentation performance on nighttime images by double enhancement before and after the segmentation network.
We train our image-adaptive segmentation model in an end-to-end manner, which ensures that CNN-PP can learn an appropriate DIF to enhance the image for segmentation and learn a LGF to preserve edges and details.
The supervised experiments show that the proposed method can significantly improve segmentation performance on ACDC_Night and NightCity datasets. The unsupervised experiments on Dark Zurich and Nighttime Driving testbeds show that our method achieves state-of-the-art performance for unsupervised nighttime semantic segmentation.
Ii Related Work
Ii-a Semantic Segmentation
Image semantic segmentation is essential to many visual understanding systems, whose performance on benchmark datasets has been greatly improved due to the development of Convolutional Neural Networks (CNNs). FCN  was considered as a milestone, which demonstrates the capability of training a deep network for semantic segmentation in an end-to-end manner on variable-size images. Multi-level-based methods [18, 52] employed the multi-scale analysis to extract the global context while preserving the low-level details. Moreover, the convolution layer was used to generate the final per-pixel predictions. DeepLab and its variants [2, 3] introduced Atrous Convolution and Atrous Spatial Pyramid Pooling to the segmentation network. In , a Scale-Adaptive Network was proposed to deal with the problem of objects with different scales. Some approaches [8, 43] either made use of knowledge distillation or designed small networks to trade-off between accuracy and inference speed.
All the above methods focus on segmentation in daytime conditions. In this paper, we pay attention to night-time scenes. To investigate the effectiveness of our proposed DIAL-Filters on nighttime driving-scene segmentation, we select three popular and widely used segmentation networks as baselines, including RefineNet , PSPNet  and DeepLabV2 .
Ii-B Image Adaptation
Image adaptation is widely used in both low-level and high-level tasks. For image enhancement, some traditional methods [26, 48, 40] adaptively calculate the parameters of image transformation according to the corresponding image features. Wang et al.  proposed an brightness adjustment function that adaptively tunes the enhancement parameters based on the illumination distribution characteristics of an input image. Methods in [13, 47, 49] employed a small CNN to flexibly learn the hyperparameters of image transformation. Yu et al. 
utilized a small CNN to learn image-adaptive exposures with deep reinforcement learning and adversarial learning. Hu et al. proposed a post-processing framework with a set of differentiable filters, where deep reinforcement learning (DRL) is used to generate the image operation and filter parameters according to the quality of the retouched image. For high-level detection task, Zhang et al.  presented an improved canny edge detection method that use mean values of gradient of entire image to adaptively select dual-threshold. IA-YOLO  proposed a light CNN to adaptively predict the filter’s parameters for better detection performance. Inspired by these methods, in this work we adopt image adaptation for segmentation in nighttime driving scenarios.
Ii-C Nighttime Driving-scene Semantic Segmentation
While most of the existing works focus on the “normal” scenarios with well-illuminated scenes, there are some works to address the challenging scenarios such as nighttime scenes. Domain adaptive methods [36, 50, 19, 1] have achieved encouraging performance in many tasks, such as classification, object detection, pedestrian identification and segmentation. Thus, some researchers employed domain adaption-based methods [5, 39, 28, 31] to transfer the model trained in normal scenes to the target domain. In , a progressive adaptation approach was proposed to transform from daytime to nighttime via the bridge of twilight time. Sakaridis et al. [28, 31] presented a guided curriculum adaptation method based on DMAda , which gradually adapt segmentation models from day to night using both the annotated synthetic images and unlabeled real images. However, the additional segmentation models for different domains in these gradual adaptation methods increase the computation cost significantly. Some studies [27, 32, 25] trained the additional style transfer networks, e.g., CycleGAN , to perform the day-to-night or night-to-day image transfer before training the semantic segmentation models. The disadvantage of these methods is that the performance of the subsequent segmentation network lies in highly dependent on the previous style transfer model.
Recently, Deng et al.  proposed a NightLab method, which includes a hardness detection module to find the hard region. Moreover, they combined the Swin-Transformer  and DeformConv  for nighttime segmentation. However, NightLab is very time consuming since it trains multiple deep models, e.g., detection network, regularized light adaptation module, Swin-Transformer and DeformConv network. Wu et al. [44, 45] proposed an unsupervised one-stage adaptation method, where an image relighting network is placed at the head of the segmentation network. Adversarial learning was employed to achieve domain alignment between the labeled daytime data and unlabeled nighttime data. Unfortunately, the additional RelightNet incurs a large number of parameters and computation.
In contrast to the above methods, we suggest an image-adaptive segmentation approach to nighttime segmentation by embedding the proposed DIAL-Filters into a segmentation network. Our method can also be trained on unsupervised domain adaption with adversarial loss, which demonstrates significant advantages in both performance and efficiency.
Iii Dual Image-adaptive Learnable Filters
Driving-scene images captured in nighttime have poor visibility due to the weak illuminations, which lead to the difficulties in semantic segmentation. Since each image may contain both overexposed and underexposed regions, the key of alleviating the difficulty in nighttime segmentation is to deal with exposure difference. Therefore, we suggest a set of dual image-adaptive learnable filters (DIAL-Filters) to enhance the results before and after the segmentation network. As illustrated in Figure 2, the whole pipeline consists of an image-adaptive processing module (IAPM), a segmentation network and a learnable guided filter (LGF). The IAPM module includes a CNN-based parameters predictor (CNN-PP) and a set of differentiable image filters (DIF).
Iii-a Image-adaptive Processing Module
Iii-A1 Differentiable Image Filters
As in 
, the design of image filters should conform to the principle of differentiability and resolution-Independence. For the gradient-based optimization of CNN-PP, the filters should be differentiable in order to allow the network training by backpropagation. Since CNN may consume intensive computational resources to process high resolution images (e.g.,), we learn the filter parameters from the downsampled low-resolution image of size . Moreover, the same filter is applied to the image of original resolution so that these filters are independent of image resolution.
Our proposed DIF consists of several differentiable filters with adjustable hyperparameters, including Exposure, Gamma, Contrast and Sharpen. As in , the standard color operators, such as Gamma, Exposure and Contrast, can be expressed as pixel-wise filters.
Pixel-wise Filters. The pixel-wise filters map an input pixel value into an output pixel value , in which represent the values of the three color channels of red, green and blue, respectively. The mapping functions of the three pixel-wise filters are listed in Table I, where the second column lists the parameters to be optimized in our approach. Exposure and Gamma are simple multiplication and power transformations. Obviously, these mapping functions are differentiable with respect to both the input image and their parameters.
|Exposure||: Exposure value|
|Gamma||: Gamma value|
|Contrast||: Contrast value|
The differentiable contrast filters are designed with an input parameter to set the linear interpolation between the original image and the fully enhanced image. As shown in TableI, the definition of in the mapping function is as follows:
Sharpen Filter. Image sharpening can highlight the image details. Like the unsharpen mask technique , the sharpening process can be described as follows:
where is the input image. denotes Gaussian filter, and is a positive scaling factor. This sharpening operation is differentiable with respect to both and . Note that the sharpening degree can be tuned for better segmentation performance by optimizing .
Iii-A2 CNN-based Parameters Predictor (CNN-PP)
In camera image signal processing (ISP) pipeline, some adjustable filters are usually employed for image enhancement, whose hyperparameters are manually tuned by experienced engineers through visual inspection 
. Such a tuning process is very awkward and expensive to find the suitable parameters for a broad range of scenes. To address this limitation, we employ a small CNN as a parameter predictor to estimate the hyperparameters, which is very efficient.
The purpose of CNN-PP is to predict the DIF’s parameters by understanding the global content of the image, such as brightness, color and tone, as well as the degree of exposure. The downsampled image is sufficient to estimate such information, which can greatly save the computational cost. As in , we apply a small CNN-PP to the low-resolution version of the input image to predict the hyperparameters of DIF. Given an input image of any resolution, we simply use bilinear interpolation to downsample it into resolution. As shown in Figure 2, the CNN-PP network is composed of five convolutional blocks, one dropout layer with a rate of 0.5 and a fully-connected layer. Each convolutional block includes a
convolutional layer with stride 2 and a leaky Relu. The final fully-connected layer outputs the hyperparameters for the DIF module. The output channels of these five convolutional layers are 16, 32, 64, 128 and 128, respectively. The CNN-PP model contains only 278K parameters when the total number of DIF’s hyperparameters is 4.
Iii-B Learnable Guided Filter
Many recent approaches to high-level visual tasks cascade a guided filter behind their original architecture to improve the results [12, 42]. Guided filter  is a type of edge-preserving and gradient-preserving image operation, which makes use of the object boundary in the guidance image to detect the object saliency. It is able to suppress the saliency outside the objects, improving the down-streamed detection or segmentation performance.
The original guided filter has a guided map , an input image , and the output image . As in Eq. (5), it supposes that
is a linear transformation ofin a window centered at the pixel .
The () are some linear coefficients which are assumed to be constants in . We can obtain the final solution of () as follows 
are the mean and variance ofin a window . is the number of pixels in , is a regularization parameter, and is the mean of in . When applying the linear transformation to each window , as shown in Eq. (8), we can obtain the filtering output by averaging all the possible values of :
where and are the average coefficients of all windows overlapping . To further enhance the segmentation results, we introduce a learnable guide filter behind the segmentation network. Figure 3 illustrates its architecture. The input is the output of the segmentation network, and the guided map is the output of . involves two convolutional layers with 64 and 19 output channels, containing only 1,491 parameters. The LGF module is trained along with other modules in an end-to-end manner, which ensures LGF to adaptively process each image for better segmentation performance with edge preservation.
Iv Nighttime Semantic Segmentation
The proposed DIAL-Filters are added to the segmentation network to form our nighttime segmentation method. As shown in Figure 2, we plugin the IAPM and LGF into the head and end of the segmentation network, respectively. Most of existing methods adopt the unsupervised domain adaption methods to deal with nighttime segmentation. To make a more comprehensive comparison, we propose both supervised and unsupervised segmentation frameworks based on DIAL-Filters in this paper.
Iv-a Supervised Segmentation with DIAL-Filters
As illustrated in Figure 2, our supervised nighttime segmentation method consists of an IAPM module, a segmentation network and a LGF module. The IAPM module includes a CNN-based parameters predictor (CNN-PP) and a set of differentiable image filters (DIF). We firstly resize the input image into the size of , and feed it into CNN-PP to predict DIF’s parameters. Then, the image filtered by DIF is treated as the input for segmentation network. The preliminary segmentation image is filtered by LGF to obtain the final segmentation results. The whole pipeline is trained end-to-end with segmentation loss so that the CNN-PP is able to learn an appropriate DIF to enhance the image adaptively for better semantic segmentation.
Iv-A2 Segmentation Network
Iv-A3 Re-weighting and Segmentation Loss
Since the numbers of pixels for different object categories in the driving-scene images are uneven, it is difficult for the network to learn the features for the categories of small-size objects. This leads to poor performance in predicting the pixels of small objects. We use a re-weighting scheme to improve the network’s attention to small-size objects. The re-weighting equation is as follows:
where represents the proportion of pixels which are annotated as category
in the labeled Cityscapes dataset. Obviously, the lower the value ofis, the higher the weight is assigned. Therefore, it facilitates the network to segment the categories of smaller-size objects. The weight is normalized as follows:
are the mean and standard deviation of, respectively. We set and by default during training.
We utilize the popular weighted cross-entropy loss to account for segmentation:
where is the -th channel of the segmentation result, is the weight set in Eq. (10). is the number of valid pixels in the corresponding segmentation labeled image, is the number of labeled categories in the Cityscapes dataset, and
denotes the one-hot encoding of the ground truth of the-th category.
Iv-B Unsupervised Segmentation with DIAL-Filters
The Dark Zurich  is a relatively comprehensive nighttime dataset for real-world driving scenarios, which contains the corresponding images of the same driving scenes at daytime, twilight and nighttime. There are three domain images in our unsupervised method, including a source domain and two target domains and , where , , and denote Cityscapes (daytime), Dark Zurich-D (daytime), and Dark Zurich-N (nighttime), respectively. As shown in Figure 4, our unsupervised nighttime segmentation framework employs the similar architecture to . The proposed unsupervised framework consists of three training circuits, which executes domain adaption from labeled source domains to two target domains and through the weight-sharing IAPM module, segmentation network and LGF module. It is worthy mentioning that only the images in Cityscapes have the semantic labels during training.
Following , we design the discriminators to distinguish whether the segmentation results are from the source domain of the target domains by applying adversarial learning. Specifically, there are two discriminators with the same structures in our model. Each of them involves five convolutional blocks with the output channel of . Each convolutional block includes a 4 × 4 convolution layer with a Leaky Relu. Except that the stride of the first two convolution layers is 2, the rest is 1. They are trained to distinguish whether the output is from or and from or , respectively.
Iv-B3 Objective Functions
When training the proposed end-to-end unsupervised framework, we use the total loss for generator and the corresponding adversarial loss for discriminator. The total loss consists of segmentation loss , static loss and adversarial loss .
Segmentation Loss: As in Eq. (11), we take the weighted cross-entropy loss as the segmentation loss. In particular, in our unsupervised framework, only the annotated source domain images are used to optimize this loss. We also set and during the unsupervised training process.
Static Loss: Considering the similarities between the daytime images in Dark Zurich-D and their corresponding nighttime images in Dark Zurich-N, we employ a static loss for the target domain nighttime images as in . This supports pseudo pixel-level supervision for the static object categories, e.g., road, sidewalk, wall, vegetation, terrain and sky.
We first define as the target domain daytime segmentation result. represents the corresponding nighttime segmentation prediction. When calculating the static loss, we only pay attention to the channels corresponding to the static categories. Thus, we can obtain and , where is the number of the categories of static objects. We then obtain the re-weighted daytime segmentation result as the pseudo label by Eq. (10). Finally, the static loss is defined as below:
where is the number of valid pixels in the corresponding segmentation labeled map. denotes the likelihood map of the correct category, which is defined as follows:
The operation represents the one-hot encoding of the semantic pseudo ground truth , and is each position of the window centered at .
Generative adversarial training is widely used to align two domains. In this case, we use two discriminators to distinguish whether the segmentation prediction is from the source domain or the target domain. We employ the least-squares loss function in our adversarial training. The adversarial loss is defined by:
where is the label for the source domain. Finally, we define the total loss of the generator (G) as follows:
where and are set to 1, 1 and 0.01, respectively, during training.
The loss functions of two discriminators and are defined as follows:
where is the label for the target domains.
In this section, we first present the experimental testbeds and evaluation metrics. Then, we perform both unsupervised and supervised experiments to investigate the effectiveness of our method in nighttime driving-scene semantic segmentation. For the supervised experiments, we evaluate our approach on three datasets, including Cityscapes, NightCity  and ACDC , which have ground truth with pixel-level semantic annotations. For the unsupervised tests, we perform a domain adaption from Cityscapes (with labels) to Dark Zurich .
V-a Datasets and Evaluation Metrics
For all experiments, we employ the mean of category-wise intersection-over-union (mIoU) as the evaluation metric. The following datasets are used for model training and performance evaluation.
V-A1 Cityscapes 
Cityscapes is a semantic understanding dataset focused on daytime urban street scenes, which is widely used as a benchmark dataset for segmentation tasks. It includes 19 categories of pixel-level annotations, and consists of 2,975 training images, 500 validation images and 1,525 testing images with resolution. In this work, we employ Cityscapes as the daytime labeled dataset in both supervised and unsupervised experiments.
V-A2 NightCity 
The NightCity is a large dataset of nighttime city driving scenes with pixel-level annotations, which can be used for supervised semantic segmentation. There are 2,998 images for traning, 1,299 images for validation or testing with pixel-level annotations of 19 categories. The labeled object classes are the same as the Cityscapes .
V-A3 Acdc 
ACDC is an adverse conditions dataset with the correspondences for semantic driving scene understanding relationship. It contains 4,006 images with high-quality pixel-level semantic annotations evenly distributed among the four common adverse conditions in real-world driving environments, namely fog, nighttime, rain and snow. Both the resolution and labeled categories are the same as Cityscapes. The ACDC dataset contains 1,000 haze images, 1,006 nighttime images, 1,000 rain and 1,000 snow images for dense pixel-level semantic annotation. We use the ACDC_night as our supervised experimental dataset, which consists of 400 training, 106 validation and 500 test images.
V-A4 Dark Zurich 
Dark Zurich is a large dataset with urban driving scenes designed for unsupervised semantic segmentation. It includes 2,416 nighttime images, 2,920 twilight images and 3,041 daytime images for training, which are all unlabeled with resolution of . These images are captured in the same scenes during daytime, twilight and nighttime so that they can be aligned by image features. In this work, we only employ 2,416 night-day image pairs to train our unsupervised model. There are also 201 nighttime images with pixel-annotation in the Dark Zurich dataset, involving 50 images for validation (Dark Zurich-val) and 151 images for testing (Dark Zurich-test), which can be used for quantitative evaluation. The Dark Zurich-test dataset provides only one validation channel via the official website. We obtain the mIoU result of our proposed approach on Dark Zurich-test by submitting the segmentation predictions to the online evaluation website.
V-A5 Nighttime Driving 
V-B Supervised Segmentation with DIAL-Filters
V-B1 Experimental Setup
, all experiments utilize the semantic segmentation models that are pre-trained on Cityscapes for 150,000 epochs. The mIoU of pre-trained DeepLabV2, RefineNet and PSPNet on Cityscapes validation set are 66.37, 65.85 and 63.94, respectively. During training, we employ random cropping with size ofon the scale between 0.5 and 1.0, and apply random horizontal flipping to expand the training dataset. As in [2, 44]
, we train our model using the Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9 and a weight decay of. The initial learning rate is set to
, and then we employ the poly learning rate policy to decrease it with a power of 0.9. The batch size is set to 4. We conducted our experiments on Tesla V100 GPU, and our approach is implemented by PyTorch.
|Methods||mIoU (%) on N_t||mIoU (%) on C_v|
|C||C + N||C||C + N|
V-B2 Experiments on Cityscapes and NightCity Datasets
To demonstrate the effectiveness of our proposed method, we plugin DIAL-Filters into three classic semantic segmentation networks, and perform experiments on three labeled datasets. Table II reports the quantitative results of the existing methods and the proposed approach trained on Cityscapes (“C” columns) or hybrid datasets (“C+N” columns), respectively. With training on hybrid datasets (Cityscapes and NightCity) and validated on NightCity_test, our method outperforms DeepLabV2, PSPNet and RefineNet by 1.85%, 2.44% and 2.41%, respectively. Compared with these methods trained on daytime Cityscapes, our method can still improve them by 0.20%, 2.65% and 1.30% on daytime Cityscapes validation set, while the baseline models of hybrid data training have less improvement or even become worse. This demonstrates that the IAPM module is able to adaptively process the image with different illumination for better semantic segmentation. Figure 5 shows several visual examples of our method and the baseline PSPNet (trained on “C+N”). It can be observed that our method has better segmentation performance on the categories that are overlooked by other methods at night, such as pole and traffic sign.
V-B3 Experiments on Cityscapes and ACDC_night Datasets
We examine the effectiveness of the proposed method on the hybrid datasets of Cityscapes and ACDC_night. As depicted in Table III, our proposed DIAL-Filters with any of the three backbones performs better than the baseline models on ACDC_night test dataset. Figure 6 shows the qualitative comparisons between our method and the baseline RefineNet. It can be observed that the presented IPAM module is able to reveal more objects by adaptively increasing the brightness and contrast of input image, which are essential to semantic segmentation in the region of small objects. Figure 7 illustrates how the CNN-PP module predicts DIF’s parameters, including the detailed parameter values and the images processed by each sub-filter. After the input image is processed by the learned DIF module, more image details are revealed, which are conducive to the subsequent segmentation task.
|Methods||mIoU (%) on A_t||mIoU (%) on C_v|
|C||C + A||C||C + A|
V-C Unsupervised Segmentation with DIAL-Filters
V-C1 Experimental Setup
As in the supervised experiments, we employ DeepLabV2 , RefineNet  and PSPNet  as baseline models to perform unsupervised segmentation experiments. The proposed model is trained by the Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.9 and a weight decay of . Like , we employ Adam optimizer to train the discriminators with set to . The learning rate of the discriminators is set to . Moreover, we apply random cropping with the crop size of 512 on the scale between 0.5 and 1.0 for the Cityscapes dataset, and the crop size is set to 960 on the scale between 0.9 and 1.1 on the Dark Zurich dataset. In addition, random horizontal flipping is used in the training. The other related settings are consistent with the supervised experiments.
|DANNet (DeepLabV2) ||88.6||53.4||69.8||34.0||20.0||25.0||31.5||35.9||69.5||32.2||82.3||44.2||43.7||54.1||22.0||0.1||40.9||36.0||24.1||42.5|
|DANNet (RefineNet) ||90.0||54.0||74.8||41.0||21.1||25.0||26.8||30.2||72.0||26.2||84.0||47.0||33.9||68.2||19.0||0.3||66.4||38.3||23.6||44.3|
|DANNet (PSPNet) ||90.4||60.1||71.0||33.6||22.9||30.6||34.3||33.7||70.5||31.8||80.2||45.7||41.6||67.4||16.8||0.0||73.0||31.6||22.9||45.2|
V-C2 Comparison with state-of-the-art methods
We compare the proposed approach with state-of-the-art unsupervised segmentation methods, including DANNet , MGCDA , GCMA , DMAda  and several domain adaptation methods [37, 38, 17] on Dark Zurich-test and Night Driving to demonstrate the efficacy of our method. All these competing methods adopt the ResNet-101 backbone. Specifically, both our method and DANNet are tested with three baseline models. MGCDA, GCMA, and DMAda are tested with the baseline RefineNet, while the rest are based on DeepLabV2.
Experimental Results on Dark Zurich-test. Table IV reports the quantitative results on Dark Zurich-test dataset. Comparing to the state-of-the-art nighttime segmentation methods, our proposed DIAL-Filters with PSPNet achieves the highest mIoU score. It is worthy mentioning that although our model is smaller, it outperforms DANNet on all the three baseline models. It can be found that our DIAL-Filters with either DeepLabV2, RefineNet or PSPNet achieves better performance than the domain adaptation methods (see the second panel in Table IV). Figure 8 shows several visual comparison examples of MGCDA, DANNet and our method. With the proposed DIAL-Filters, our adaptive module is able to distinguish objects of interest from the images, especially small objects and confusing areas with mixed categories in the dark. Figure 9 shows an example on how the CNN-PP module predicts DIF’s parameters, including the detailed parameter values and the images processed by each sub-filter. It can be observed that our proposed DIAL-Filters are able to increase the brightness of the input image and reveal the image details, which is essential to segment the nighttime images.
Experimental Results on Night Driving. Table V reports the mIoU results on Night Driving test dataset. In contrast to the state-of-the-art nighttime segmentation methods, our DIAL-Filters with PSPNet achieves the best performance. Though our model is smaller, it outperforms DANNet by 2.21%, 1.96% and 2.62%, respectively, when RefineNet, DeepLabV2 and PSPNet are used as baselines. In addition, it can be clearly seen that our method achieves better performance than the domain adaptation methods.
V-D Ablation Study
To examine the effectiveness of each module in our proposed framework, including IAPM, LGF and DIF, we conduct ablation experiments with different settings, . All experiments are trained on the mixed datasets of Cityscapes and NightCity in a supervised manner, where the weight parameters are pre-trained 150,000 epochs on Cityscapes.
Table VI shows the experimental results. We select RefineNet (ResNet-101) as the base model, and ’DIAL-Filters’ is the full model of our method. The settings and training data are the same for all experiments. It can be seen that DIF preprocessing, LGF postprocessing and image adaptive IAPM all improve the segmentation performance. The RefineNet_deep is a deeper version of RefineNet, whose backbone is ResNet-152 with 15,644K more learnable parameters than ResNet-101. Our proposed approach performs better than RefineNet_deep with only 280K additional parameters in CNN-PP and LGF. The method with fixed DIF means that the filter’s hyperparameters are a set of given values, all of which are within a reasonable range. Clearly, our DIAL-Filters method achieves the best performance on both NightCity_test and Cityscapes_test, which indicates that our method can adaptively process both the daytime images and nighttime ones. This is essential to the down-streamed segmentation tasks. Moreover, the LGF for postprocessing can further boost the performance. Figure 10 shows the visual results with/without LGF. It can be seen that the learnable guided filter obtains more precise segmentation boundaries of small objects.
|Method||mIoU (%) on N_t||mIoU (%) on C_v|
|w/ fixed DIF||49.17||65.77|
|Method||Additional Params||Speed (ms)|
V-E Efficiency Analysis
In our proposed framework, we introduce a set of novel learnable DIAL-Filters with 280K trainable parameters into a segmentation network. CNN-PP has five convolutional layers, a dropout layer and a fully connected layer, and LGF includes two convolutional layers. Based on RefineNet, Table VII compares the efficiency of some methods used in our experiments. All these methods deploy an add-on module into RefineNet. The second column lists the number of additional parameters over the RefineNet model. The third column lists the running time on a color image with of size using a single Tesla V100 GPU. It can be observed that our method only adds 280K trainable parameters over RefineNet while achieving the best performance on all experiments with comparable running time. Note that though our method has fewer trainable parameters than DANNet, its running time is slightly longer. This is because the filtering process in the DIF module incurs extra computation.
In this paper, we proposed a novel approach to semantic segmentation in nighttime driving conditions by adaptively enhancing each input image to obtain better performance. Specifically, we introduced dual image-adaptive learnable filters (DIAL-Filters) and embedded them into the head and end of a segmentation network. A fully differentiable image processing module was developed to preprocess the input image, whose hyperparameters were predicted by a small convolutional neural network. The preliminary segmentation results were further enhanced by learnable guided filtering for more accurate segmentation. The whole framework was trained in an end-to-end fashion, where the parameter prediction network was weakly supervised to learn an appropriate DIF module through the segmentation loss in the supervised experiments. Our experiments on both supervised and unsupervised segmentation demonstrated the superiority of the proposed DIAL-Filters to previous nighttime driving-scene semantic segmentation methods.
-  (2022) All-higher-stages-in adaptive context aggregation for semantic edge detection. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §II-C.
-  (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI 40 (4), pp. 834–848. Cited by: §II-A, §II-A, §IV-A2, §V-B1, §V-C1, TABLE II, TABLE III, TABLE IV, TABLE V.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, pp. 801–818. Cited by: §II-A.
The cityscapes dataset for semantic urban scene understanding.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §I, §V-A1, §V-A2, §V-A3, §V-A5, §V.
-  (2018) Dark model adaptation: semantic image segmentation from daytime to nighttime. In IEEE Intelligent Transportation Systems Conference, pp. 3819–3824. Cited by: §I, §II-C, §V-A5, §V-A5, §V-C2, TABLE IV, TABLE V.
-  (2017) Deformable convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 764–773. Cited by: §II-C.
-  (2022) NightLab: a dual-level architecture with hardness detection for segmentation at night. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16938–16948. Cited by: §II-C.
-  (2021) Double similarity distillation for semantic image segmentation. IEEE Trans. on Image Processing 30 (), pp. 5363–5376. Cited by: §II-A.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §I.
-  (2012) Guided image filtering. IEEE transactions on pattern analysis and machine intelligence 35 (6), pp. 1397–1409. Cited by: §III-B, §III-B.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §I, §IV-A2.
-  (2017) Deep level sets for salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2300–2309. Cited by: §III-B.
-  (2018) Exposure: a white-box photo post-processing framework. ACM Transactions on Graphics (TOG) 37 (2), pp. 26. Cited by: §I, §II-B, §III-A1, §III-A1.
-  (2020) Semantic image segmentation by scale-adaptive networks. IEEE Trans. on Image Processing 29 (), pp. 2066–2077. Cited by: §II-A.
-  (2020) Encoder-decoder with cascaded crfs for semantic segmentation. IEEE Transactions on Circuits and Systems for Video Technology 31 (5), pp. 1926–1938. Cited by: §I.
-  (2016) Combining markov random fields and convolutional neural networks for image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2479–2486. Cited by: §I.
-  (2019) Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6936–6945. Cited by: §V-C2, TABLE IV, TABLE V.
-  (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1925–1934. Cited by: §I, §II-A, §II-A, §IV-A2, §V-B1, §V-C1, TABLE II, TABLE III, TABLE IV, TABLE V.
-  (2020) Cross-domain complementary learning using pose for multi-person part segmentation. IEEE Transactions on Circuits and Systems for Video Technology 31 (3), pp. 1066–1078. Cited by: §II-C.
Image-adaptive yolo for object detection in adverse weather conditions.
Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §I, §II-B.
-  (2021) Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. Cited by: §II-C.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §II-A.
Least squares generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2794–2802. Cited by: §IV-B3.
-  (2020) Hardware-in-the-loop end-to-end optimization of camera image processing pipelines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7529–7538. Cited by: §III-A2.
-  (2019) What’s there in the dark. In International Conference on Image Processing, pp. 2996–3000. Cited by: §I, §II-C.
-  (2000) Image enhancement via adaptive unsharp masking. IEEE Transactions on Image Processing 9 (3), pp. 505–510. Cited by: §II-B, §III-A1.
-  (2019) Bridging the day and night domain gap for semantic segmentation. In 2019 IEEE Intelligent Vehicles Symposium, pp. 1312–1318. Cited by: §II-C.
-  (2019) Guided curriculum model adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7374–7383. Cited by: §I, §II-C, §V-C2, TABLE IV, TABLE V.
-  (2019) Guided curriculum model adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §IV-B1, §V-A4, §V.
-  (2021) ACDC: the adverse conditions dataset with correspondences for semantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10765–10775. Cited by: §I, §V-A3, §V.
-  (2021) Map-guided curriculum domain adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence. Cited by: §I, §II-C, §V-C2, TABLE IV, TABLE V.
See clearer at night: towards robust nighttime semantic segmentation through day-night image conversion.
Artificial Intelligence and Machine Learning in Defense Applications, Vol. 11169, pp. 111690A. Cited by: §I, §II-C.
-  (2021) Gaussian dynamic convolution for efficient single-image segmentation. IEEE Transactions on Circuits and Systems for Video Technology 32 (5), pp. 2937–2948. Cited by: §I.
-  (2021) Night-time scene parsing with a large real dataset. IEEE Transactions on Image Processing 30, pp. 9085–9098. Cited by: §I, §V-A2, §V.
-  (2020) Semantic segmentation to develop an indoor navigation system for an autonomous mobile robot. Mathematics 8 (5), pp. 855. Cited by: §I.
-  (2021) Partial domain adaptation on semantic segmentation. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §II-C.
-  (2018) Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7472–7481. Cited by: §IV-B2, §V-C2, TABLE IV, TABLE V.
-  (2019) Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2517–2526. Cited by: §V-C2, TABLE IV, TABLE V.
-  (2019) Weakly supervised adversarial domain adaptation for semantic segmentation in urban scenes. IEEE Trans. on Image Processing 28 (9), pp. 4376–4386. Cited by: §II-C.
-  (2021) An adaptive weak light image enhancement method. In Twelfth International Conference on Signal Processing Systems, Vol. 11719, pp. 1171902. Cited by: §II-B.
-  (2021) Stage-aware feature alignment network for real-time semantic segmentation of street scenes. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §I.
-  (2018) Fast end-to-end trainable guided filter. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1838–1847. Cited by: §III-B.
-  (2021) CGNet: a light-weight context guided network for semantic segmentation. IEEE Transactions on Image Processing 30 (), pp. 1169–1179. Cited by: §II-A.
-  (2021) DANNet: a one-stage domain adaptation network for unsupervised nighttime semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15769–15778. Cited by: §I, §II-C, §IV-A2, §IV-B1, §IV-B3, §V-B1, §V-C1, §V-C2, TABLE IV.
-  (2021) A one-stage domain adaptation network with image alignment for unsupervised nighttime semantic segmentation. IEEE Transactions on Pattern Analysis & Machine Intelligence (01), pp. 1–1. Cited by: §II-C.
-  (2018) Real-to-virtual domain unification for end-to-end autonomous driving. In Proceedings of the European Conference on Computer Vision, pp. 530–545. Cited by: §I.
-  (2018) Deepexposure: learning to expose photos with asynchronously reinforced adversarial learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 2153–2163. Cited by: §I, §II-B.
-  (2004) A fast and adaptive method for image contrast enhancement. In 2004 International Conference on Image Processing, 2004. ICIP’04., Vol. 2, pp. 1001–1004. Cited by: §II-B.
-  (2020) Learning image-adaptive 3d lookup tables for high performance photo enhancement in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §II-B, §III-A2.
-  (2018) Unsupervised domain adaptation using robust class-wise matching. IEEE Transactions on Circuits and Systems for Video Technology 29 (5), pp. 1339–1349. Cited by: §II-C.
-  (2015) Image adaptive edge detection based on canny operator and multiwavelet denoising. In 2014 International Conference on Computer Science and Electronic Technology, pp. 335–338. Cited by: §II-B.
-  (2017) Pyramid scene parsing network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §I, §II-A, §II-A, §IV-A2, §V-B1, §V-C1, TABLE II, TABLE III, TABLE IV, TABLE V.
Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2242–2251. Cited by: §II-C.