Autonomous mobile agents need a high-level understanding of their environment to plan their trajectories and function effectively. This requires robust perception which stems from object detection and semantic segmentation like tasks for recognizing and localizing safety-critical objects such as pedestrians and vehicles. Most object detection methods are designed for and trained with images captured under ideal environmental conditions, and often do not generalize to adverse settings (like foggy and low-light conditions). Recent attempts to achieve such robustness include domain classification based invariant detection [chen2018domain, Hsu_2020_WACV, Tian_2021_ICCV, wu2021vector, wu2021instance, lin2021domain, gu2021pit, he2020domain], prior-knowledge based feature adaptation [sindagi2020prior, zhang2020unified], adversarially-trained image alignment [zhang2021domain, zhuang2020ifan], map-specific domain adaptation [sakaridis2020map], and physics-prior based zero-shot learning [lengyel2021zero, Zheng_2022_WACV].
More recently, learnable image pre-processing methods have emerged as a superior alternative [liu2022imageadaptive, lengyel2021zero, huang2020dsnet, liu2019griddehazenet, guo2020zero, hu2018exposure, Zheng_2022_WACV, Yang_2020_CVPR, LI2021106617, Zhang_2020_ACCV]. However, these methods are either limited to a single preprocessing module [lengyel2021zero, Zheng_2022_WACV, Yang_2020_CVPR, LI2021106617, Zhang_2020_ACCV], require domain-specific architectural variations [liu2019griddehazenet, guo2020zero, liu2022imageadaptive] or use multiple modules in an arbitrary sequential order [hu2018exposure, liu2022imageadaptive]. In this work, we address all these limitations and present a novel approach that learns to enhance images for object detection under adverse conditions in an end-to-end manner. This is achieved through a learnable gating-based weighted combination of concurrent image processing operations, dubbed GDIP (Gated Differentiable Image Processing). Our proposed GDIP method integrated with Yolo significantly outperforms the current SOTA, Image Adaptive Yolo (IA-Yolo [liu2022imageadaptive]), which relies on an arbitrary sequential image preprocessing. On real-word fog (RTTS [RTTS]) and low-light (Ex-Dark [Exdark]) datasets, GDIP-Yolo leads IA-Yolo in mAP by 5.76 and 15.89 respectively. The key contributions of this paper are listed below:
a novel gating mechanism that enables concurrent relative weighting of multiple differentiable image processing modules to enhance images for object detection under adverse environmental conditions;
a multi-level version of GDIP where an image is progressively enhanced through multiple GDIP blocks, each guided by a different layer of the image encoder; and
an adaptation of GDIP as a training regularizer which directly improves object detection training for adverse conditions, eliminating the need of GDIP during inference, thus saving compute time with a minor drop in performance.
Ii Related Work
Object detection is the problem of localizing and classifying objects in the scene and has seen a recent uptick in popularity due to its applications in autonomous vehicles and more. There are two primary approaches to the object detection problem, two-stage detection, and single-stage detection. Two-stage detectors like FasterRCNN[ren2015faster] and MaskRCNN [he2017mask] utilize a region proposal network (RPN) which generates proposals of plausible regions of interest, sent to the downstream network that performs classification. Two-stage approaches are computationally expensive, reducing their application range. Single-stage detectors like Yolo [redmon2018yolov3], RetinaNet [lin2017focal], SSD [liu2016ssd] and FCOS [tian2019fcos] bypass the heavy RPN and directly extract objects with their associated labels. Nevertheless, under adverse conditions, both types of networks fail to detect objects.
Typical object detection techniques fail in adverse weather conditions, and transfer learning has proven to be a viable way to employ object detection in adverse weather conditions. Chen et al.[chen2018domain] approach this problem from a domain adaptation perspective and utilize image and instance-level features to reduce the domain shift. Singadi et al. [sindagi2020prior] use weather-specific knowledge and define a prior-adversarial loss with feature recovery to mitigate weather effects on detection. Multiscale Domain Adaptive Yolo (MS-DAYOLO) [hnewa2021multiscale] uses classifiers for each domain at different scales to learn domain invariant features. Zhang et al. [zhang2021domain] employ image-level feature alignment to match local and global features.
Differentiable Image Pre-processing: Another popular approach to the problem is to perform image enhancement before object detection. In Exposure [hu2018exposure]
, a deep Reinforcement Learning model learns a policy to apply a sequence of enhancement operations. AOD-Net[li2017aod] dehazes images using a CNN designed on a re-formulated atmospheric scattering model. Dong et al. [dong2020multi] use an encoder-decoder architecture (U-Net) with the strength-operate-subtract boosting strategy to help dehaze images. GridDehazeNet [liu2019griddehazenet] employs a multi-scale attention mechanism with pre and post-processing modules to generate better inputs and reduce artifacts in the final dehazed image. He et al. [he2010single] use a dark channel prior (one color channel in most pixels will be low) to dehaze images but do not perform object detection. Some models target only a single adverse condition, Guo et al. [guo2020zero]zeng2020learning] learn multiple look-up tables and use CNN predictions to fuse the look-up tables into one and transform the color and tone of a source image. DSNet [huang2020dsnet] uses two subnets for image restoration and object detection to boost performance in adverse weather conditions. IA-Yolo [liu2022imageadaptive] uses a CNN to predict differentiable image processing parameters trained in conjunction with Yolov3 [redmon2018yolov3] for object detection to enhance images, and perform object detection in an end-to-end fashion.
Unlike existing methods, GDIP is a domain agnostic network architecture that handles multiple image processing operations concurrently. Additionally, it has a unique advantage with its utility as a training regularizer, which eliminates image enhancement overhead during inference resulting in higher throughput.
Iii Proposed Method
We propose a Gated Differentiable Image Processing (GDIP) framework that learns to enhance input images for object detection in adverse environmental conditions. The GDIP block learns parameters for multiple Image Processing (IP) operations performed concurrently and learns the optimal weights to combine their output. We use the following IP operations (similar to IA-Yolo [liu2022imageadaptive]): tone correction (), contrast balance (), sharpening (), defogging (), gamma correction (), white balancing (), and the identity operation (). Unlike IA-Yolo’s sequential image enhancement, GDIP enhances images through a weighted combination of concurrent IP operations.
Iii-a Gated Differentiable Image Processing (GDIP) block:
The GDIP block (shown in Fig. 2) consists of multiple gated image processing modules, referred to as , that individually enhance images, which are then combined through the weights predicted by the gates. Each module contains a linear layer, a differentiable image processing operation, a gate (shifted tanh function that returns a value between 0 and 1), and a normalization operation. The linear layer (purple linear block in Fig. 2) computes two entities: the parameters required by the differentiable IP block and a scalar value that serves as an input to its corresponding gate. The individual linear layers of every module are passed a common feature embedding as input, obtained from a shared vision encoder (described later). The output of the IP operation (using the predicted parameters) gets multiplied by the scalar output of the gate. The weighted outputs of individual blocks are finally aggregated to obtain an enhanced image. Expressed mathematically, the output of the GDIP block is:
where is the input image captured under adverse environmental conditions, is the enhanced clear image, represents the IP operation (top-right in Fig. 2) weighted by its respective scalar gate output , and is the min-max normalization operation. Normalization ensures that the pixel intensity range of the output of all the image processing operations are the same. The IP operations are expressed mathematically in Table I, see [liu2022imageadaptive] for a detailed description.
Our proposed GDIP block requires latent embeddings to compute image processing parameters and gate values. For this purpose, we employ a vision encoder comprising five convolutional layers (each with a kernel size of three and a stride of one). The number of channels in each layer is double the previous, starting from 64 in the first layer and 1024 in the final layer. Each convolution operation is followed by average pooling (with kernel size three and stride two), while the last layer is followed by global average pooling, the output of which is a 1x1x1024. This is then projected to a 256-dimensional latent space using a fully connected layer. The GDIP block takes this 256-dimensional embedding from the vision encoder along with the adverse input image and performs image enhancement after computing the necessary parameters.
GDIP-Yolo: To integrate GDIP with Yolo, we use the vision encoder with GDIP to perform image enhancements (depicted in Fig. 2), and use the enhanced image as input to Yolo. Integrating GDIP with Yolo in this fashion ensures that our architecture doesn’t require any additional loss formulation and uses Yolo’s standard object detection loss [redmon2016you] (referred to as ) to train the network for object detection end-to-end.
Iii-B Multi-Level GDIP (MGDIP):
GDIP-Yolo contains a single GDIP block, which is fed with latent embeddings obtained from the vision encoder. Since, we only use the last layer of the vision encoder for this purpose, it limits the extent of information available for GDIP to learn parameters for image processing modules. Thus, we propose multi-level progressive image enhancement, achieved by integrating a GDIP block with every layer of the vision encoder, dubbed MGDIP-Yolo. Note that the individual image processing modules within a single GDIP block still operate concurrently with their corresponding gates providing relative weightings. As shown in Fig. 3
, MGDIP progressively enhances images by feeding the output from one GDIP block as input to the next, where individual GDIP blocks are guided by the features extracted from different layers of the vision encoder. The final enhanced output from MGDIP is passed to Yolo for object detection. MGDIP-Yolo is trained in an end-to-end manner using the standard object detection loss, similar to GDIP-Yolo.
We hypothesize that utilizing embeddings from different layers provides GDIP access to multiple feature scales, each of which can have a varied relevance for different image processing operations. This is based on the understanding that earlier layers in CNNs capture lower level information (local information like edges) and later layers capture high-level (global) information. Thus, MGDIP gains the ability to use the local/global feature properties to selectively apply image processing operations.
Iii-C GDIP block as a regularizer:
In this section, we demonstrate how GDIP can also be employed as a feature regularization technique to improve Yolo’s performance while maintaining its throughput.
The original GDIP block used a vision encoder to obtain feature embeddings. Alternatively, multiple GDIP blocks can be connected to intermediate layers of Yolo, bypassing the need for a vision encoder and directly using Yolo’s embeddings to construct an enhanced output, as shown in Fig. 4. Note that this enhanced output is not the input to Yolo but rather a byproduct that we use for training regularization. The reconstruction loss (Eq. 2) is calculated between this output and the clear version of the input image as a combination of norm and Mean Square Error loss
. The overall loss function used is shown in Eq.3, where is the weight of the reconstruction loss and is empirically set to .
Inclusion of the reconstruction loss in the formulation helps Yolo learn features that are invariant to adverse conditions, resulting in better performance when compared to standalone Yolo. Since the GDIP blocks exist solely to refine Yolo’s features, it is only required during training and can be removed during inference. This results in an unchanged network architecture (Yolo) that performs better in adverse weather conditions along with higher throughput.
Iv Experimental Setup
Foggy Conditions: We use the RTTS dataset [RTTS], a collection of 4322 natural foggy images with five annotated classes - person, car, bus, bicycle and motorcycle - primarily for testing. The PascalVOC train/val datasets (2007 and 2012) [pascal-voc-2007, pascal-voc-2012] have 22136 clear images that we use as a base to create a synthetic training set. We select images from PascalVOC having objects belonging to the five classes from RTTS and create two datasets, one with clear images (VOCNormal) and one with augmented foggy images generated using the atmospheric scattering model (ASM) [AtmospshereScatteringModel]
. We employ the ASM to generate 10 different levels of fog to include variance in our synthetic training set. We subsample and prepare a synthetic testing set in a similar fashion from the PascalVOC 2007 test set with 4952 images (referred to as V_F_Ts). We employ ahybrid strategy where we use a mix of foggy and clear images (in a 2:1 ratio) to help our model learn fog-invariant features.
Low-lighting Conditions: The ExDark dataset [Exdark] is a collection of 7363 real-world images with 10 object classes in low lighting conditions that we use to evaluate our models. Similar to preparing the foggy dataset, we select images from PascalVOC having objects from the 10 classes of ExDark and apply a gamma filter to emulate a low-lighting condition. Mathematically, , where is sampled uniformly from the range of 1.5 to 5, is the normalized clear image, and is the synthetic dark image. Using the same selection and image processing methods, we generate a synthetic low-light test set (V_D_Ts) from the PascalVOC test set. During training, we employ a hybrid strategy (similar to the foggy setting) by using a mix of dark and clear images.
Iv-B Training Setup
Training for both foggy and low-lighting setting is done by resizing images to
pixels and with a batch size of 6 for 80 epochs. We use a cosine learning rate scheduler with learning rates ranging fromto and an SGD optimizer with a weight decay of .
V Results and Analyses
We provide qualitative and quantitative results that establish our proposed method’s superiority and evaluate design choices through ablation studies. We also show that GDIP variants provide flexibility to prioritize speed or accuracy based on application requirements.
V-a Qualitative Analysis
We compare the results of GDIP-Yolo with the current SOTA IA-Yolo, shown in Fig. LABEL:fig:qualitative on real-world data. Unlike IA-Yolo, our method clears fog and improves lighting conditions without changing underlying color distributions. Our method is able to detect far-off objects such as cars and bicycles (see the third column in Fig. LABEL:fig:qualitative), which are generally missed in extreme foggy conditions by SOTA IA-Yolo (see the second column). The last column (low-light conditions) clearly indicates GDIP-Yolo enhances lighting without artifacts, helping the model detect all objects in the scene (the car in the background, for example). We quantify the improvement extended by GDIP in the next section.
V-B Quantitative Analysis
We compare our proposed method with other SOTA works using the standard object detection evaluation metric - mean average precision (mAP). All mAP values are calculated at an IoU (Intersection over Union) of 0.5.
In Table II, we compare our proposed variants of the GDIP with other competing methods on VOCNormal Test set (V_N_Ts), synthetic VOCFoggy Test set (V_F_Ts), and the real-world foggy dataset RTTS. The second column in the table shows the training data used by each of the methods, where “Hybrid” implies the use of both clear and foggy data. We set YoloV3 as the baseline, which is trained on a mix of foggy and clear images to validate if we can improve performance by using data augmentation. We also compare our results against a diverse range of methods based on domain adaption (DA-Yolo [hnewa2021multiscale]), multi-task learning (DSNet [huang2020dsnet]), defogging as pre-processing (MSBDN [dong2020multi], GridDehaze [liu2019griddehazenet]), and adaptive image enhancement (IA-Yolo [liu2022imageadaptive]). It can be observed in Table II that our proposed variants of GDIP establish a new SOTA across different fog datasets.
All GDIP variants perform significantly better than SOTA methods on RTTS, which tests the generalizability of our method to real-world conditions. Our basic GDIP-Yolo variant outperforms the SOTA method IA-Yolo by 5.34 mAP. This can be attributed to the concurrently weighted IP operations unlike IA-Yolo’s fixed sequential pipeline. MGDIP-Yolo further improves upon the GDIP-Yolo by 0.42 mAP and does so consistently across all datasets. It emerges superior to all other methods and GDIP variants, as it benefits from multi-scale information. Our regularizer variant outperforms all SOTA methods, while being as fast as vanilla Yolo (see Subsection V-E), emerging as an alternative to GDIP-Yolo and MGDIP-Yolo with an accuracy-speed trade-off.
We compare our proposed variants with other SOTA methods on the real-world ExDark dataset, synthetic low-lighting VOCDark test set (V_D_Ts), and VOCNormal test set (V_N_Ts), as shown in Table III. Here once again, we set YoloV3 as the baseline trained on hybrid data of a mix of dark and clear images. In addition, we also compare against a diverse range of methods based on light enhancement as pre-processing (ZeroDCE [guo2020zero]), domain adaptation (DA-Yolo [hnewa2021multiscale]), multi-task learning (DSNet [huang2020dsnet]) and adaptive image enhancement (IA-Yolo [liu2022imageadaptive]).
Our proposed GDIP-Yolo outperforms all the existing methods on the real-world ExDark dataset and achieves an absolute increase of 16 mAP over the previous SOTA IA-Yolo. Additionally, MGDIP-Yolo and GDIP as a regularizer variants also perform superior to other methods in this setting. In the synthetic low-lighting setting, MGDIP-Yolo approach emerges superior, while the other GDIP variants also significantly improve the performance over other methods. On the VOCNormal test set, our proposed method performs comparable to DSNet [huang2020dsnet]. To conclude, our proposed variants are significantly better than existing approaches for low-light settings (synthetic and real-world).
V-C Detection Statistics
We present True and False Positives (TP, FP) and False Negatives (FN) of the number of object detections as an interpretable statistical measure as mAP does not convey the actual detections. The TP (Fig. 5 left) and FN (Fig. 5 middle) plots show substantial improvement at high object detection confidence thresholds both for RTTS and Ex-Dark datasets for GDIP-Yolo vis a vis SOTA IA-Yolo. For Autonomous Driving applications, the TP and FN statistics are critical, as not detecting an object when present can be catastrophic, and on these vital statistics, the significantly superior performance of GDIP is evident. GDIP-Yolo evaluates to comparable FP metrics vis a vis SOTA at high confidence thresholds on which it shows vastly superior performance on the TP, and FN metrics.
V-D Ablation Studies
In this section, we demonstrate experimental support for using the gating mechanism and normalization layer in our proposed GDIP block. We validate that incorporating gating and normalization in the GDIP block provides a stable improvement in the downstream detection task for both foggy and low-light conditions (as shown in Table IV).
Single Best vs Weighted Combination: Our proposed gating mechanism helps combine image processing operations through relative weighting (see Eq. 1). To illustrate its effect, we remove this mechanism and use a single best image processing operation based on the highest gate value (referred to as GDIP-max in Table IV), expressed as where . Without the proposed gating, performance reduces by 10.4 mAP for the RTTS dataset (comparing row 1 and 2 in Table IV). For the ExDark dataset the performance increases by a negligible amount - 0.15 mAP, insubstantial compared to the drop observed in the foggy setting. This study indicates that incorporating enhancements from multiple image processing operations is necessary as no single operation is sufficient for dealing with adverse conditions.
Uniform vs Predicted Weighting: In this experiment, we compare our proposed relative weighting of image processing operations with a uniform weighting across all operations (referred to as GDIP w/o gates in Table IV), expressed mathematically as . We observe that mAP reduces by 0.77 and 0.27 for RTTS and ExDark, respectively. This performance drop can be attributed to the fact that depending on the environmental conditions of images, image processing operations need to be weighted differently, achieved through gating. This is evident from Fig. 6, where activation of the gamma gate (G) is prominent for low light conditions, and in the case of foggy conditions, most of the gates remain activated with defogging (DF) being the highest. The gate activation patterns also help improve the interpretability of GDIP by indicating which IP operations are performed by what proportions based on the input image.
With and Without Normalization: We also verify the necessity for the normalization layer after each image processing operation by removing them (GDIP-unnormalized in Table IV), expressed as . This leads to a considerable performance drop of around 1.8 and 2.35 mAP in RTTS and ExDark, respectively.
Overall, these ablation studies indicate that GDIP with normalization and the gating mechanism leads to the best overall performance irrespective of environmental conditions and is a promising solution for the object detection task.
|Method||RTTS (mAP)||ExDark (mAP)|
|GDIP w/o gates||41.65||42.29|
V-E Real-time performance
In Table V, we compare the real-time performance of our proposed GDIP variants with other techniques. GDIP peforms the fastest as a regularizer at around 68 fps on a Nvidia GTX 1080Ti, which is the same as YoloV3. Our basic variant - GDIP-Yolo operates at 7 fps higher than IA-Yolo, while achieving SOTA mAP on real-world fog and night datasets.
|Yolo V3||68.39 1.5|
|MGDIP-Yolo (top-down)||11.38 0.02|
|MGDIP-Yolo (bottom-up)||11.25 0.03|
|GDIP as regularizer||68.39 1.5|
We presented GDIP and MGDIP as domain-agnostic network architectures for object detection in adverse weather conditions, which can be used with existing object detection networks and trained under different adverse conditions, as we demonstrated for fog and low lighting. We also presented a training regularizer variant of GDIP, which improves the baseline Yolo performance under adverse conditions while maintaining its original throughput. All our GDIP variants result in a new state-of-the-art on challenging real-world datasets both under foggy and low-lighting conditions, while only having trained on synthetic adverse condition data, thus exhibiting significant generalization capability.
In future, this work can be extended to other adverse condition types (e.g., haze, rain, snow, etc.) along with additional relevant image pre-processing operations which are easy to integrate given their concurrent processing and relative weighting within GDIP. For long-term autonomy and highly-safe operations, dealing with adverse conditions is crucial for autonomous vehicles, and this work pushes the boundaries of robust perception, getting a step closer to the ubiquity of autonomous vehicles.