Multi-Task Learning for Automotive Foggy Scene Understanding via Domain Adaptation to an Illumination-Invariant Representation

09/17/2019 ∙ by Naif Alshammari, et al. ∙ 12

Joint scene understanding and segmentation for automotive applications is a challenging problem in two key aspects:- (1) classifying every pixel in the entire scene and (2) performing this task under unstable weather and illumination changes (e.g. foggy weather), which results in poor outdoor scene visibility. This poor outdoor scene visibility leads to a non-optimal performance of deep convolutional neural network-based scene understanding and segmentation. In this paper, we propose an efficient end-to-end contemporary automotive semantic scene understanding approach under foggy weather conditions, employing domain adaptation and illumination-invariant image per-transformation. As a multi-task pipeline, our proposed model provides:- (1) transferring images from extreme to clear-weather condition using domain transfer approach and (2) semantically segmenting a scene using a competitive encoder-decoder convolutional neural network (CNN) with dense connectivity, skip connections and fusion-based techniques. We evaluate our approach on challenging foggy datasets, including synthetic dataset (Foggy Cityscapes) as well as real-world datasets (Foggy Zurich and Foggy Driving). By incorporating RGB, depth, and illumination-invariant information, our approach outperforms the state-of-the-art within automotive scene understanding, under foggy weather condition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Scene understanding and pixel-wise segmentation is an active research topic requiring robust image pixel classification. However, the performance of the many state-of-the-art scene understanding algorithms is limited to clear weather conditions such that extreme weather and illumination variation could lead to inaccurate scene classification and segmentation [9, 24, 2, 37, 34, 6]. Up to now, too little attention has been paid to address the issue of automotive scene understanding under extreme weather conditions (foggy weather conditions as an example) [31, 7]

, by proposing deep learning approaches generally applicable to ideal weather conditions only. This paper introduces an efficient algorithm that tackles the challenge of applicability of extreme weather conditions via a novel multi-task learning approach that translates foggy scene images to clear the scene and utilizes depth and luminance images for a superior semantic segmentation performance.

Fig. 1: Exemplar prediction results of the proposed approach. FS: Foggy Scene [31]; GCS: Generated Clear Scene using [42]; IIT: Illumination-Invariant Transform using [24]; along with the corresponding semantic segmentation outputs (ours).
Fig. 2: Overview of our approach using [42, 18]. The source domain (foggy scene) mapped to the target domain (fine scene). Subsequently, the generated () is fed to the RGB encoder in the semantic segmentation network. Depth (D) and luminance (L) image are incorporating RGB via DL encoder. Ultimately, the output from the two encoders is passed to the semantic segmentation decoder.

Previous attempts to tackle the issue of scene understanding under non-ideal weather conditions for shadow removal and illumination reduction [24, 2, 37, 34], haze removal and scene defogging [14, 25, 38], and foggy scene understanding [31, 7] are mostly based on conventional methods. Despite the general trend of the performance improvement within the automotive context [41, 4, 15, 23], there is still room for further improvement via using more advanced deep neural networks. In parallel with using the recent techniques used in image segmentation [17, 13, 35, 18]

, employing the concepts of image-to-image translation to map from one domain to another

[42, 19] could be a useful step to colour transform that enables accurate semantic segmentation performance under extreme weather conditions.

In this work, we propose an efficient end-to-end automotive semantic scene understanding via multi-task approach, comprising both domain adaptation and segmentation network architectures together. These architectures are capable of benefiting from domain transformation, illumination-invariant image pre-transformation, depth and luminance information to achieve superior scene understanding and segmentation performance. As an effective technique to avoid information loss and share high-resolution features in the latter reconstruction stages during up-sampling of a CNN, we use skip-connections [29, 26, 39, 35]. In addition, we use a fusion-based idea as an integration method within the overall model construction phases. To assess the impact on semantic segmentation performance, extensive experiments are also conducted using different invariant transformations (initial pre-process).

Ii Related Work

Literature review is organized into three main categories: (1) semantic segmentation (Section II-A), (2) domain transfer (Section II-B), and (3) illumination-invariant and perceptual colour space computation (Section II-C).

Ii-a Semantic Segmentation

Modern segmentation techniques utilize deep convolutional neural networks and outperform the traditional approaches by a large margin [3, 4, 15, 41, 23]

. These contributions use a large dataset such as ImageNet

[30] for pre-trained models. Recent segmentation techniques have their distinct characteristics by their designs such as:– (1) the network topology for instance using: pooling indices [3], skip connection [29], multi-path refinement [23], pyramid pooling [41], fusion-based architecture [13] and dense connectivity [17], (2) using alternative inputs such as depth as an extra channel RGB-D for the input image [13, 16], incorporate depth and luminance [18], and illumination invariant [1], (3) inclusive or not adverse-weather conditions [31, 7]. As the main objective of this work is semantic segmentation under foggy weather condition, recent studies in this latter domain within the literature are specifically presented.

Foggy Scene Segmentation: Although studies have recognized the issue of foggy weather within scene understanding, there is limited research within the literature. One of these approaches, named SFSU [31], shows a semi-supervised approach adapting methods in [23, 40] to perform scene understanding under foggy scenes using synthetic data. Generating Foggy Cityscapes [31] (partially synthetic dataset discussed in 4) by adding fog to real images in the known datasets Cityscapes [5], the approach of [31]

overcome the high cost of gathering and annotating data under extreme weather conditions. The above-mentioned supervised step is conducted to improve the semantic segmentation performance. Subsequently, this supervised learning combined with the unsupervised technique by augmenting clear weather images to their synthetic fog images. In another study, CMAda

[7] proposes an adaptive semantic segmentation model from light synthetic fog to real dense fog. Developing a fog simulator, CMAda generates foggy datasets by adding synthetic fog into real images. Like [31], CMAda [7] is based on the RefineNet architecture [23] as semantic segmentation.

Ii-B Domain Transfer

Transferring an image from its real domain to another differing domain allows multiple uses of such images taken in complex environments or generated in different forms. Using recent advances in the field of image style transfer, [10], where they generate the target image by capturing the style texture information of the input image by utilizing the Gram matrix. Work by [22] shows that image style transfer (from the source domain to the target domain) is the process of minimizing the differences between source and target distribution. Recent methods [19, 42, 33] used Generative Adversarial Network (GAN) [12] to learn the mapping from source to target images. Based on training over a large dataset for specific image style, [42] shows an efficient approach to transfer image style from one image into another. Within the context of our work, we take advantage of [42] to improve the semantic segmentation by generating target scenes (clear-weather scenes) from the source domain (foggy scenes images) as a source image mapped into a target domain - hence significantly increasing our available image data training resource.

Ii-C Illumination-Invariant Images

An illumination-invariant image is a single channel image calculated by combining the three RGB colour channels in the image that removes (or minimizes) scene colour variations due to varying scene lighting conditions. Inspired by [9], there has been numerous illumination-invariant image representation techniques [9, 24, 2, 37, 34, 6, 20] proposed in the literature. Mainly used for shadow removal in the outdoor scenes, illumination-invariant pre-processing provides better scene classification and understanding through reducing the illumination variations [1]. In this work, we take advantage of the image illumination-invariant transformation proposed in [24], to be used as an alternative input to CNN-based scene understanding.

Iii Proposed Approach

Our main objective is to train an end-to-end network that semantically labels every pixel in the scene that is invariant weather to conditions and illumination variations. In general, our approach consists of two sub-components, namely domain transfer and semantic segmentation (each of them could be functioning as an integrated unit). These sub-components produce two separate outputs: generated image (from fog to clear-weather) and semantic pixel labels. The pipeline of our approach is shown in Figure 5. In this section, we will discuss the details of the two sub-components: Domain Transfer Model and Semantic Segmentation Model.

Iii-a Domain Transfer Model

Our goal is to learn mapping from source domain (foggy scenes) to the target domain (clear-weather) for which we assume such scene visibility level in the constructed image is the optimal input to the Semantic Segmentation Model (Section II-A). We use generative adversarial networks proposed in [42] to generate the target images used later for the semantic segmentation task. A generator (generating clear scenes samples ) and a discriminator (to discriminate between and ) are used to perform the mapping function from the source and target domains. The loss for each generator with is calculated as follows:

(1)

where is the data distribution, the source domain with samples and the target domain with the samples .

Iii-B Semantic Segmentation Model

As a subsequent component to the overall model, our pipeline performs the task of semantic segmentation on generated images (target domain via ) incorporated with depth and luminance , via Semantic Segmentation Model) (shown in Fig. 3). Motivated by [18], we use an auto-encoder model for automotive semantic segmentation. The network design mainly consists of two encoders:- RGB encoder () and depth with luminance encoder (), to downsample the input image, and a decoder () to upsample the feature maps to the original input dimension. In addition, dense-connections and extracted fusion maps are implemented in the baseline architecture.

Fig. 3: Details of the segmentation network which consists of two encoders taking two types of inputs: RGB Image and DL Image (with Depth and Luminance channels).

RGB encoder: As designed to deal with a three-channel RGB input, RGB encoder (

) comprises three downsampler blocks with convolutional and max pooling layers followed by batch normalization and ReLu activation function ({16, 64, 128} respectively). Subsequently, five non-bottleneck modules are implemented including the factorized convolutions (convolution kernel

factorized into and

), each followed by batch normalization and ReLu with residual connections. With dilated and factorized convolutions, eight non-bottleneck modules were implemented as a last component of

.

DL encoder: Unlike RGB encoder, depth and luminance encoder () deals with depth and luminance images (concatenated as two-channel input). As a parallel functioning to , is designed with a dense connectivity technique for the information flow enhancement from earlier to last layers. Specifically, consists of a downsampler (same as in ) followed by three dense blocks each has four, three, four modules respectively ( has the same number of channels as ). Each dense block is followed by a transition layer designed with and followed with average pool layer. As some datasets do not contain depth maps, we use a luminance only encoder () which is identical to () except taking the only luminance channel.

and are linked by fusing output layers from blocks sharing the same number of channels among and . The fusion connectivity is simply implemented by summing the two layers such that for inputs and , the fused feature map is or .

Decoder: After fusing the last extract feature maps from and either or , a decoder is performing upsample the feature maps to the original resolution. The upsampling is implemented in three stages {64, 16, 19}. In the first two stages, convolutional transpose, batch normalization and ReLu activation function, as well as two non-bottleneck modules, are employed. As a last component in the encoder, a convolutional transpose layer mapped the output to 19 class labels we aim to predict (class labels).

Unlike LDFNet [18]

, we utilize skip connections for the fused features in the encoders into the decoder in order to avoid loss of the high-level spatial features before being downsampled. The fused feature maps {64, 16} passed from the encoders are concatenated with the corresponding upsampled feature maps in the decoder. As a semantic segmentation loss function, pixel-wise softmax with cross-entropy is used summing over all pixels within a patch as follows:

(2)
(3)

where denotes to the output of the segmentation network, the feature activation for the channel , is the number of classes, is the approximated maximum function and is the ground truth label. As a multi-task end-to-end pipeline, the joint loss function for the model with two tasks is calculated as follows:

(4)

where is a weighting coefficient, and empirically chosen as to balance the two losses.

Iv Dataset

The availability of numerous well-annotated datasets such as [5, 11, 30, 8] has led to a proliferation in semantic segmentation studies. In this section, we will present the following datasets used in this paper:- Cityscapes dataset [5] as the base dataset representing clear scenes, Foggy Cityscapes dataset [31] as a partially synthetic dataset where the fog is added into the clear-scenes, Foggy Driving [31] and Foggy Zurich [7] as real-world images where the adverse-weather (fog) is present. Besides, we will show the illumination-invariant pre-transformation used in our approach, as a technique to reduce the impact of such varying illumination conditions.

Fig. 4: Sample images from Foggy Zurich [7] and Foggy Driving [31] along with their annotations.

Cityscapes Dataset: We evaluate our approach on the Cityscapes [5], a large dataset for semantic segmentation of urban scenes. The dataset comprises of training and testing image examples (at a resolution of ) with 19 pixel classes: {road, sidewalk, building, wall, fence, pole, traffic light, traffic sign, vegetation, terrain, sky, person, rider, car, truck, bus, train, motorcycle and bicycle}.

Foggy Cityscapes Dataset: Foggy Cityscapes [31] is a partially synthetic dataset generated from real scenes from Cityscapes [5] by adding synthetic fog to the real images using fog simulation [31] (the real images are taken in the clear-weather conditions). Three different versions of this dataset depending on fog density level (controlled using attenuation coefficient - from light to dense fog) are used. As real scenes dataset is provided with fine annotated images, we employ the annotations as labels for the synthetic foggy datasets. Foggy Cityscapes dataset consists of training and testing image examples (resolution on ).

Dataset Foggy Zurich Foggy Driving Foggy Cityscapes
MethodsResults Global avg. Class avg. Mean IoU Global avg. Class avg. Mean IoU Global avg. Class avg. Mean IoU
AdSegNet [36] 0.25 0.44
SFSU [31] 0.35 0.46
CMAda2+ [32] 0.43 0.49
CMAda3+ [7] 0.46 0.49
Our approach
IAB 0.84 0.54 0.43 0.91 0.65 0.54 0.89 0.69 0.54
IHS 0.89 0.61 0.45 0.90 0.62 0.51 0.88 0.66 0.53
FS 0.89 0.60 0.51 0.88 0.58 0.44 0.90 0.69 0.56
IIT 0.91 0.63 0.52 0.91 0.67 0.54 0.92 0.70 0.59
GCS 0.94 0.60 0.54 0.89 0.72 0.59 0.92 0.71 0.60
TABLE I: Quantitative comparison of semantic segmentation over the Foggy Zurich [7] and Foggy Driving [31] datasets of scenes with 19 classes. Our approach addressed five variants: with pre-transform as Illumination-Invariant Transform IIT [24], combined with as IAB, or with as IHS; with no transform (Foggy Scenes) FS in the presence of fog; and with Generated Clear Scene GCS using [42].

Foggy Driving Dataset: Foggy Driving dataset [31] (Fig. 4) is a real-world dataset collected in the foggy-weather condition, consisting of images (at a resolution of ) with annotations for semantic segmentation and object detection tasks. Following Cityscapes dataset, Foggy Driving dataset is labelled with 19 classes.

Foggy Zurich Dataset: Foggy Zurich [7] (Fig. 4) is a real foggy-scenes dataset consisting of images (at the resolution of ) collected in the city of Zurich. Using Cityscapes approach, Foggy Zurich provides pixel-level annotations for 40 scene, including dense fog.

Illumination-Invariant Pre-transformation: The illumination-invariant image, where global illumination variation and localised shadows are significantly reduced within the scenes, is computed using the approach proposed in [24]. To generate such an invariant representation, a 3- channel floating point RGB image () converted into the corresponding illumination-invariant image as follows.

(5)

where for the reference camera in use (Point Grey Bumblebee-2), and for pixels normalised into the range . For further evaluation, we use the illumination-invariant channel combined with as IAB, and with as IHS to assess their impact on the performance of our model.

Luminance Transformation: Luminance transformation is a translated grayscale image generated from

to reduce the noise and improve getting better feature extraction, defined as follows:

(6)

V Implementation Details

We implement our approach in PyTorch

[27]. For optimization, we employ ADAM [21] with an initial learning rate of and momentum of . The weighting coefficient in the loss function is empirically chosen to be . By following [28] and [18], we weight the classes of the dataset duo to imbalance number of pixels of each class in the dataset as follows:

(7)

where c is an additional parameter set to 1.10 to restrict the class weight and

is the probability of belonging to that class. We train the model for

epoch by using NVIDIA Titan X and GTX 1080Ti GPUs. Like [18], we apply data augmentation in training using random horizontal flip. For semantic accuracy evaluation, we use the following evaluation measures:- (1) class average accuracy, the mean of the predictive accuracy over all classes, (2) global accuracy, which measures overall scene pixel classification accuracy, and (3) mean intersection over union (mIoU).

Vi Evaluation

We evaluate the performance of automotive scene understanding and segmentation using the modified LDFNet [18] CNN architecture, on the Cityscapes [5], Foggy Driving [31], and Foggy Zurich [7] datasets. The evaluation was performed as follows:

  1. we train the semantic segmentation model (Fig. 3) (employed later as a sub-model in the overall model (Fig. 2) trained in step 3) on the Foggy Cityscapes (partially synthetic datasets).

  2. we fine-tune the model trained in step 1 on Foggy Zurich and Foggy Driving dataset (real-world datasets).

  3. foggy datasets used in the model trained in step 1 are first mapped to clear scenes using the domain adaptation sub-model (Fig. 2), and subsequently the generated images are fed into the semantic segmentation sub-model (Fig. 2) to train the second-task (semantic segmentation).

  4. we fine-tune the model obtained from step 3 on Foggy Zurich and Foggy Driving dataset.

  5. we generate the illumination-invariant transform IIT and perceptual colour-space IAB and IHS from the foggy datasets and use them as alternative inputs for the model trained in step 1, to assess their impact.

Method

Road

Sidewalk

Building

Fence

Wall

Vegetation

Terrain

Car

Truck

Train

Bus

Bicycle

Motorcycle

Sky

Pole

Traffic-sign

Traffic-light

Person

Rider

FS 95.1 65.8 78.8 29.1 39.6 49.4 45.6 53.0 85.7 59.9 70.0 58.6 26.8 81.4 51.3 36.2 02.1 42.8 47.0
IAB 94.6 71.0 78.4 31.5 43.2 47.4 47.3 54.2 85.7 58.3 64.4 51.2 34.4 71.3 41.4 41.1 22.8 48.0 45.5
IHS 95.2 71.7 77.5 29.8 35.6 44.5 49.6 54.8 84.7 63.2 63.8 49.8 28.4 72.5 30.2 41.0 16.3 57.5 50.4
IIT 95.6 74.2 80.2 34.9 44.5 52.2 49.4 57.6 86.6 61.3 67.2 66.5 53.3 80.5 60.1 48.2 27.5 50.7 53.4
GFS 95.0 66.4 84.4 24.3 45.8 51.6 50.9 56.3 87.9 59.9 84.7 63.6 45.6 85.0 55.8 47.7 33.8 50.2 49.2
TABLE II: Class IoU results on the Foggy Cityscapes [31] using five evaluation methods. FS: Foggy Scene (no transform); IAB: illumination-invariant compined with AB in (LAB); IHS: Illumination-Invariant combined with HS in (HSV); GCS: Generated Clear Scene [42]; IIT: Illumination-Invariant Transform using [24].
Fig. 5: A comparison of semantic segmentation predictions on Cityscapes [5] for the proposed approach. Left column shows results with no domain adaptation and image pre-transformation, followed by the four scenarios:- GCS: Generated Clear Scene [42]; IIT: Illumination-Invariant Transform [24]; IHS: Illumination combined HS in (HSV colour space); and IAB: Illumination combined with AB in (LAB colour space).

We will present and analyze the results of each step above in the next of this section.

Foggy Scenes (FS): As an initial stage, we evaluate the performance of semantic segmentation on foggy scene datasets: {Foggy Cityscapes [31], Foggy Driving [31], and Foggy Zurich [7]}. Here, we deal with the semantic segmentation model (shown in Fig 3) as an independent model and isolated from the whole pipeline to investigate its performance on the foggy dataset. With an improvement of (), our approach achieves the heights mean intersection-over-union (mIoU) accuracy () when compared with the reference results of CMAda [7] on Foggy Zurich [7]. However, the reference results of CMAda [7] produce the highest mIoU () on Foggy Driving [31] (see Table I). As the scene visibility is significantly poor in the foggy weather condition, this method produces the lowest accuracy (when compared with using domain adaptation and illumination-invariant pre-transformation techniques discussed later) in the evaluation measures: (overall, class average, and mean IoU accuracy) () respectively using (Foggy Cityscapes), () (Foggy Driving), and () (Foggy Zurich) (see Table I). Within individual class performance, this method fails to achieve any improvement (Table II) when compared with the methods discussed in the next of this section.

Generated Clear Scenes (GCS): The second evaluation is performed on the generated datasets (with clear scenes) using domain transfer approach shown in Section III-A. We mapped the source domain (foggy scenes) in the aforementioned foggy datasets into the target domain (clear scenes), assuming that will increase the level of visibility and lead to better scene understanding and segmentation. In this stage, we make use of the two models (domain transfer and semantic segmentation) simultaneously in one pipeline (Fig. 2). As a result, this method outperforms the reference results of [31, 7] and the above method on all the evaluation measures (overall, class average, mIoU) () respectively using Cityscapes, () (Foggy Driving), and () (Foggy Zurich) (see Table I). With an improvement of () and () in mIoU on Foggy Zurich and Foggy Driving respectively, this method shows a large impact of employing domain transfer technique to map the complex weather condition to the clear one. For per-class performance, this method achieves higher results (on Foggy Cityscapes) in (cIoU) (seven classes) among the five methods conducted in this work (see Table II), placing it second when compared with the other methods in this manner.

Illumination-Invariant Transform (IIT): As a slight difference over the first evaluation, we train the semantic segmentation model on the illumination-invariant images generated from the foggy datasets (Foggy Cityscapes, Foggy Driving, Foggy Zurich) using the approach of [24]. In other words, we are replacing foggy dataset (RGB images) with illumination-invariant images to see their impact on the CNN-based model performance. This method achieves the highest mIoU accuracy () and () on (Foggy Zurich) and (Foggy Driving) respectively when compared with the reference results of [31, 7] (see Table I). However, this technique has achieved the second highest accuracy in (overall, class average, and mIoU accuracy) () respectively using Cityscapes, () (Foggy Driving), and () (Foggy Zurich) when compared with the above methods (see Table I). As per-class comparison, superior performance in the majority of classes (eleven classes) (using Foggy Cityscapes) among the five methods conducted in this work (see Table II) has achieved using , ranking it in the top.

Vii Conclusion

This paper proposes a novel end-to-end automotive semantic scene understanding via domain transfer to an illumination-invariant representation. The proposed model can semantically predict per pixel scene labels under extreme foggy-weather conditions. The use of domain adaptation maps a scene taken in foggy conditions to a target domain considered optimal (clear-weather) with visibility scene increased. As a result, the performance of deep convolutional network-based scene understanding and segmentation under weather-condition has progressively improved [31, 7]. As another way to improve scene understanding, we use an illumination-invariant pre-transformation technique, with and without hybrid colour information, applied as alternative inputs to semantic segmentation network. By examining these transforms, we are able to show that pre-processing can influence trained network performance. Using fusion-based architecture, dense connectivity and skip-connections for the feature fusion, our approach achieves significant results over the state-of-the-art semantic segmentation under foggy weather condition [31, 7, 32].

References

  • [1] N. Alshammari, S. Akcay, and T. P. Breckon (2018) On the impact of illumination-invariant image pre-transformation for contemporary automotive semantic scene understanding. In IEEE Intelligent Vehicles Symposium (IV), pp. 1027–1032. Cited by: §II-A, §II-C.
  • [2] J. Álvarez and A. Lopez (2011) Road detection based on illuminant invariance. IEEE Transactions on Intelligent Transportation Systems 12 (1), pp. 184–193. Cited by: §I, §I, §II-C.
  • [3] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) SegNet: a deep convolutional encoder-decoder architecture for scene segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence. Cited by: §II-A.
  • [4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Analysis and Machine Intelligence 40(4):834-848. External Links: Document, ISSN 0162-8828 Cited by: §I, §II-A.
  • [5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In

    Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §II-A, §IV, §IV, §IV, Fig. 5, §VI.
  • [6] P. Corke, R. Paul, W. Churchill, and P. Newman (2013) Dealing with shadows: capturing intrinsic scene appearance for image-based outdoor localisation. In the IEEE Int. Conf. on Intelligent Robots and Systems, pp. 2085–2092. Cited by: §I, §II-C.
  • [7] D. Dai, C. Sakaridis, S. Hecker, and L. Van Gool (2019-01) Curriculum Model Adaptation with Synthetic and Real Data for Semantic Foggy Scene Understanding. In. arXiv e-prints. External Links: 1901.01415 Cited by: §I, §I, §II-A, §II-A, Fig. 4, TABLE I, §IV, §IV, §VI, §VI, §VI, §VI, §VII.
  • [8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. Int. Journal of Computer Vision (IJCV) 88 (2), pp. 303–338. Cited by: §IV.
  • [9] G. Finlayson, M. Drew, and C. Lu (2009) Entropy minimization for shadow removal. International Journal of Computer Vision 85 (1), pp. 35–57. Cited by: §I, §II-C.
  • [10] L. A. Gatys, A. S. Ecker, and M. Bethge (2015) A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576. Cited by: §II-B.
  • [11] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3354–3361. Cited by: §IV.
  • [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Cited by: §II-B.
  • [13] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers (2016-11) FuseNet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In Asian Conference on Computer Vision, Cited by: §I, §II-A.
  • [14] K. He, J. Sun, and X. Tang (2010) Single image haze removal using dark channel prior. IEEE Trans. on Pattern Analysis and Machine Intelligence 33 (12), pp. 2341–2353. Cited by: §I.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A.
  • [16] C. Holder and T. Breckon Encoding stereoscopic depth features for scene understanding in off-road environments. In Image Analysis and Recognition, pp. 427–434. External Links: ISBN 978-3-319-93000-8 Cited by: §II-A.
  • [17] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017-07) Densely connected convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A.
  • [18] S. Hung, S. Lo, and H. Hang (2019) Incorporating luminance, depth and color information by a fusion-based network for semantic segmentation. In The IEEE Int. Conf. on Image Processing (ICIP), pp. 2374–2378. Cited by: Fig. 2, §I, §II-A, §III-B, §III-B, §V, §V, §VI.
  • [19] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2016)

    Image-to-image translation with conditional adversarial networks

    .
    arxiv. Cited by: §I, §II-B.
  • [20] T. Kim, Y. Tai, and S. Yoon (2017) PCA based computation of illumination-invariant space for road detection. In IEEE Winter Conf. on Applications of Computer Vision (WACV), pp. 632–640. Cited by: §II-C.
  • [21] P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In Proc. Int. Conf. Learning Representations. Cited by: §V.
  • [22] Y. Li, N. Wang, J. Liu, and X. Hou (2017) Demystifying neural style transfer. arXiv preprint arXiv:1701.01036. Cited by: §II-B.
  • [23] G. Lin, A. Milan, C. Shen, and I. Reid (2017-07) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A, §II-A.
  • [24] W. Maddern, A. Stewart, C. McManus, B. Upcroft, W. Churchill, and P. Newman (2014) Illumination invariant imaging: applications in robust vision-based localisation, mapping and classification for autonomous vehicles. In IEEE Int. Conf. on Robotics and Automation (ICRA), Vol. 2, pp. 3. Cited by: Fig. 1, §I, §I, §II-C, TABLE I, §IV, Fig. 5, TABLE II, §VI.
  • [25] K. Nishino, L. Kratz, and S. Lombardi (2012) Bayesian defogging. Int. Journal of Computer Vision 98 (3), pp. 263–278. Cited by: §I.
  • [26] A. E. Orhan and X. Pitkow (2017) Skip connections eliminate singularities. arXiv preprint arXiv:1701.09175. Cited by: §I.
  • [27] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §V.
  • [28] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016) Enet: a deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147. Cited by: §V.
  • [29] O. Ronneberger, P. Fischer, and T. Brox U-net: convolutional networks for biomedical image segmentation. In Proc. Int. Conf. Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Cited by: §I, §II-A.
  • [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. Int. Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §II-A, §IV.
  • [31] C. Sakaridis, D. Dai, and L. Van Gool (2018) Semantic foggy scene understanding with synthetic data. Int. Journal of Computer Vision (IJCV) 126 (9), pp. 973–992. Cited by: Fig. 1, §I, §I, §II-A, §II-A, Fig. 4, TABLE I, §IV, §IV, §IV, TABLE II, §VI, §VI, §VI, §VI, §VII.
  • [32] C. Sakaridis, D. Dai, S. Hecker, and L. Van Gool (2018) Model adaptation with synthetic and real data for semantic dense foggy scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 687–704. Cited by: TABLE I, §VII.
  • [33] P. Sangkloy, J. Lu, C. Fang, F. Yu, and J. Hays (2017-07) Scribbler: controlling deep image synthesis with sketch and color. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-B.
  • [34] J. Santos (2015) Visual road following using intrinsic images. In European Conference on Mobile Robots, pp. 1–6. Cited by: §I, §I, §II-C.
  • [35] T. Tong, G. Li, X. Liu, and Q. Gao (2017)

    Image super-resolution using dense skip connections

    .
    In Proceedings of the IEEE Int. Conf. on Computer Vision, pp. 4799–4807. Cited by: §I, §I.
  • [36] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7472–7481. Cited by: TABLE I.
  • [37] B. Upcroft, C. McManus, W. Churchill, W. Maddern, and P. Newman (2014) Lighting invariant urban street classification. In IEEE Int. Conf. on Robotics and Automation (ICRA), pp. 1712–1718. Cited by: §I, §I, §II-C.
  • [38] Y. Wang and C. Fan (2014) Single image defogging by multiscale depth fusion. IEEE Trans. on Image Processing 23 (11), pp. 4826–4837. Cited by: §I.
  • [39] J. Yamanaka, S. Kuwashima, and T. Kurita (2017) Fast and accurate image super resolution by deep cnn with skip connection and network in network. In Int. Conf. on Neural Information Processing, pp. 217–225. Cited by: §I.
  • [40] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §II-A.
  • [41] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017-07) Pyramid scene parsing network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-A.
  • [42] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE Int. Conf on Computer Vision, pp. 2223–2232. Cited by: Fig. 1, Fig. 2, §I, §II-B, §III-A, TABLE I, Fig. 5, TABLE II.