Prior-based Domain Adaptive Object Detection for Adverse Weather Conditions

11/29/2019 ∙ by Vishwanath A. Sindagi, et al. ∙ Johns Hopkins University 11

Adverse weather conditions such as rain and haze corrupt the quality of captured images, which cause detection networks trained on clean images to perform poorly on these images. To address this issue, we propose an unsupervised prior-based domain adversarial object detection framework for adapting the detectors to different weather conditions. We make the observations that corruptions due to different weather conditions (i) follow the principles of physics and hence, can be mathematically modeled, and (ii) often cause degradations in the feature space leading to deterioration in the detection performance. Motivated by these, we propose to use weather-specific prior knowledge obtained using the principles of image formation to define a novel prior-adversarial loss. The prior-adversarial loss used to train the adaptation process aims to produce weather-invariant features by reducing the weather-specific information in the features, thereby mitigating the effects of weather on the detection performance. Additionally, we introduce a set of residual feature recovery blocks in the object detection pipeline to de-distort the feature space, resulting in further improvements. The proposed framework outperforms all existing methods by a large margin when evaluated on different datasets such as Foggy-Cityscapes, Rainy-Cityscapes, RTTS and UFDD.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 7

page 8

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection [viola2001rapid, felzenszwalb2010object, girshick2014rich, girshick2015fast, liu2016ssd, ren2015faster]

is an extensively researched topic in the literature. Despite the success of deep learning based detectors on several benchmark datasets

[everingham2010pascal, deng2009imagenet, geiger2013vision, lin2014microsoft], they have limited abilities in generalizing to several practical conditions such as adverse weather.

Figure 1: (a) Weather conditions such as rain and haze can be mathematically modeled as function of clean image and the weather-specific prior. We use this weather-specific prior to define a novel prior-adversarial loss for adapting detectors to adverse weather. (b) Existing domain adaptation approaches use constant target domain label for the entire image irrespective of the amount of degradation. Our method uses spatially-varying priors that are directly correlated to the amount of degradations.

Recently, several real-world computer vision applications such as autonomous navigation/self-driving cars

[qi2018frustum, ku2018joint, liang2018deep, xu2018multi], drone-based surveillance [perera2018uav, zhu2018vision] and video surveillance/forensics [collins2000system, brutzer2011evaluation] have received tremendous interest. Object detectors form a vital backbone in these applications and hence, it is imperative that the detectors work reliably even in the presence of adverse weather conditions. As compared to the general object detection problem, the task of adapting the detectors to adverse weather conditions is relatively less explored.

One approach to solve this issue is to undo the effects of weather conditions by pre-processing the images using existing methods like image dehazing [fattal2008single, he2011single, zhang2018densely] and/or deraining [Authors16, Authors17e, Authors18]. However, these approaches usually involve complicated networks and need to be trained separately with pixel-level supervision. This additional pre-processing results in increased computational overhead at inference. Furthermore, these methods additionally involve certain post-processing like gamma correction [Sakaridis2018SemanticFS], which still results in a domain shift, thus prohibiting such approaches from achieving the optimal performance. Another approach would be to re-train the detectors on datasets that include these adverse conditions. However, creating these datasets often comes with high annotation/labeling cost.

Recently, a few methods [Chen2018DomainAF, Shan2018, Saito2018StrongWeakDA] have attempted to overcome this problem by viewing object detection in adverse weather conditions as an unsupervised domain adaptation task. These approaches consider that the images captured under adverse conditions (target images) suffer from a distribution shift [Chen2018DomainAF, gopalan2011domain] as compared to the images on which the detectors are trained (source images). It is assumed that the source images are fully annotated while the target images (with weather-based degradations) are not annotated. They propose different techniques to align the target features with the source features, while training on the source images. These methods are inherently limited in their approach since they employ only the principles of domain adaptation and neglect additional information that is readily available in the case of weather-based degradations.

We consider the following observations about weather-based degradations which have been ignored in the earlier work. (i) Images captured under weather conditions (such as haze and rain) can be mathematically modeled (see Fig. 1(a), Eq. 8 and 9). For example, a hazy image is modeled by a superposition of a clean image (attenuated by transmission map) and atmospheric light [fattal2008single, he2011single]. Similarly, a rainy image is modeled as a superposition of a clean image and rain residue [Authors16, Authors18, Authors17e] (see Fig. 1(a)). In other words, a weather-affected image contains weather specific information (which we refer to as prior) - transmission map in the case of hazy images and rain residue in the case of rainy images. These weather-specific information/prior cause degradations in the feature space resulting in poor detection performance. Hence, in order to reduce the degradations in the features, it is crucial to make the features weather-invariant by eliminating the weather-specific priors from the features. (ii) Further, it is important to note that the weather-based degradations are spatially varying and, hence do not affect the features equally at all spatial locations. Since, existing domain-adaptive detection approaches [Chen2018DomainAF, Shan2018, Saito2018StrongWeakDA] label all the locations entirely either as target, they assume that the entire image has undergone constant degradation and all spatial locations are equally affected (see Fig. 1(b)). This will lead to incorrect alignment, especially in the regions of images where the degradations are minimal.

Motivated by these observations, we define a novel prior-adversarial loss that uses additional knowledge about the target domain (weather-affected images) for aligning the source and target features. Specifically, the proposed loss is used to train a prior estimation network to predict weather-specific prior from the features in the main branch, while simultaneously minimizing the weather-specific information present in the features. This results in weather-invariant features in the main branch, hence, mitigating the effects of weather on the detection performance. Additionally, the proposed use of prior information in the loss function results in spatially varying loss that is directly correlated to the amount of degradation (as shown in Fig.

1(b)). Hence, the use of prior avoids incorrect alignment.

Finally, considering that the weather-based degradations cause distortions in the feature space, we introduce a set of residual feature recovery blocks in the object detection pipeline to de-distort the features. These blocks, inspired by residual transfer framework proposed in [he2016deep], result in further improvements.

We perform extensive evaluations on different datasets such as Foggy-Cityscapes [Sakaridis2018SemanticFS], RTTS [Li2018BenchmarkingSD] and UFDD [nada2018pushing]. Additionally, we create a Rainy-Cityscapes dataset for evaluating the performance different detection methods on rainy conditions. Various experiments demonstrate that the proposed method is able to outperform the existing methods on all the datasets.

2 Related Work

Object detection: Object detection is one of the most researched topics in computer vision. Typical solutions for this problem have evolved from approaches involving sliding window based classification [viola2001rapid, dalal2005histograms]

to the latest anchor-based convolutional neural network based approaches

[ren2015faster, redmon2016you, liu2016ssd]. Ren et al. [ren2015faster] pioneered the popular two stage Faster-RCNN approach. Several works have proposed single stage frameworks such as SSD [liu2016ssd], YOLO [redmon2016you] etc. that directly predict the object labels and bounding box co-ordinates. Following the previous work [Chen2018DomainAF, Shan2018, Saito2018StrongWeakDA, kim2019diversify, khodabandeh2019robust] we use Faster-RCNN as our base model.
Unsupervised Domain Adaptation: Unsupervised domain adaptation is defined as aligning domains having distinct distributions, namely source and target. It is assumed that images in the source dataset are available with annotations, while no annotation information is provided for the target images. Some of the recently proposed methods for unsupervised domain adaptation include feature distribution alignment [tzeng2017adversarial, ganin2014unsupervised, shu2018dirt, saito2018maximum], residual transfer [long2016unsupervised, long2017deep]

, and image-to-image translation approaches

[hu2018duplex, murez2018image, hoffman2017cycada, sankaranarayanan2018generate]

. In feature distribution alignment, an adversarial objective is utilized to learn domain-invariant features. Typically, these methods are implemented using a gradient reversal layer, where feature generator and domain classifier play an adversarial game to generate the target features that are aligned with the source feature distribution. Most of the research in unsupervised domain adaptation has focused on classification/segmentation problems and other tasks such as object detection are relatively unexplored.


Domain-adaptive object detection in adverse conditions: Compared to the problem of general detection, detection in adverse weather conditions is relatively less explored. Existing methods, [Chen2018DomainAF, Shan2018, Saito2018StrongWeakDA, kim2019diversify] have attempted to address this task from a domain adaptation perspective. Chen et al. [Chen2018DomainAF] assumed that the adversarial weather conditions result in domain shift, and they overcome this by proposing a domain adaptive Faster-RCNN approach that tackles domain shift on two levels: image-level and instance-level. Following the similar argument of domain shift, Shan et al. [Shan2018] proposed to perform joint adaptation at image level using the Cycle-GAN framework [zhu2017unpaired] and at feature level using conventional domain adaptation losses. Saito et al. [Saito2018StrongWeakDA] argued that strong alignment of the features at global level might hurt the detection performance. Hence, they proposed a method which employs strong alignment of the local features and weak alignment of the global features. Kim et al. [kim2019diversify] diversified the labeled data, followed by adversarial learning with the help of multi-domain discriminators. Cai et al. [cai2019exploring] addressed this problem in the semi-supervised setting using mean teacher framework. Zhu et al. [zhu2019adapting] proposed region mining and region-level alignment in order to correctly align the source and target features. Roychowdhury et al. [roychowdhury2019automatic] adapted detectors to a new domain assuming availability of large number of video data from the target domain. These video data are used to generate pseudo-labels for the target set, which are further employed to train the network. Most recently, Khodabandeh et al. [khodabandeh2019robust] formulated the domain adaptation training with noisy labels. Specifically, the model is trained on the target domain using a set of noisy bounding boxes that are obtained by a detection model trained only in the source domain.

Figure 2: Overview of the proposed adaptation method. We apply proposed prior adversarial loss at multiple scale of the network. The prior adversarial loss is supervised by source and target prior of respective sizes. For source pipeline, additional supervision is provided by detection loss. For target pipeline, feed-forward through the detection network is modified by the residual feature recovery blocks.

3 Proposed Method

We assume that labeled clean data () from the source domain () and unlabeled weather-affected data from the target domain () are available. Here, refers to all bounding box annotations and respective category label for the corresponding clean image , refers to the weather-affected image, is the total number of samples in the source domain () and is the total number of samples in the target domain (). Our goal is to utilize the available information in both source and target domains to learn a network that lessens the effect of weather-based conditions on the detector. The proposed method contains three network modules – detection network, prior estimation network (PEN) and residual feature recovery block (RFRB). Fig 2 gives an overview of the proposed model. During source training, a source image (clean image) is passed to the detection network and the weights are learnt by minimizing the detection loss, as shown in Fig. 2 with the source pipeline. For target training, a target image (weather-affected image) is forwarded through the network as shown in Fig. 2 by the target pipeline. As discussed earlier, weather-based degradations cause distortions in the feature space for the target images. In an attempt to de-distort these features, we introduce a set of residual feature recovery blocks in the target pipeline as shown in Fig. 2. This model is inspired from residual transfer framework proposed in [long2016unsupervised] and is used to model residual features. The proposed PEN aids the detection network in adapting to the target domain by providing feedback through adversarial training using the proposed prior adversarial loss. In the following subsections, we briefly review the backbone network, followed by a detailed discussion on the proposed prior-adversarial loss and residual feature recovery blocks.

The proposed method is explained assuming VGG [simonyan2014very] as backbone of detection network architecture for simplicity. However, we would like to note that the proposed approach is not limited to VGG and can be extended for other network architectures as well (see supplementary material).

3.1 Detection Network

Following the existing domain adaptive detection approaches [Chen2018DomainAF, Shan2018, Saito2018StrongWeakDA], we base our method on the Faster-RCNN [ren2015faster] framework. Faster-RCNN is among the first end-to-end CNN-based object detection methods and uses anchor-based strategy to perform detection and classification. For this paper we decompose the Faster-RCNN network into three network modules: feature extractor network (), region proposal network (RPN) stage and region classification network (RCN). The arrangement of these modules are shown in the Fig. 2 with VGG model architecture as base network. Here, the feature extractor network consists of first five conv blocks of VGG and region classification network module is composed of fully connected layers of VGG. The region proposal network uses output of feature extractor network to generate a set of candidate object regions in a class agnostic way. Features corresponding to these candidates are pooled from the feature extractor and are forwarded through the region classification network to get the object classifications and bounding box refinements. Since we have access to the source domain images and their corresponding ground truth, these networks are trained to perform detection on the source domain by minimizing the following loss function,

(1)
(2)

Here, represents both region proposal and region classification networks, denotes the region proposal loss, denotes the bounding-box regression loss and denotes the region classification loss. The details of these individual loss components can be found in [ren2015faster].

3.2 Prior-adversarial Training

As discussed earlier, weather-affected images, contain domain specific information. These images typically follow mathematical models of image degradation (see Fig. 1(a), Eq. 8 and Eq. 9). We refer to this domain specific information as a prior. Detailed discussion about prior for haze and rain is provided in Sec. 3.2.1 and Sec. 3.2.2, respectively. We aim to exploit these priors about the weather domain to better adapt the detector for weather affected images. To achieve that, we propose a prior-based adversarial training approach using prior estimation network (PEN) and prior adversarial loss (PAL).

Let be PEN module introduced after the conv block of and let be the corresponding domain specific prior for any image, . Then the PAL for the source domain is defined as follows,

(3)

where, and are the height and the width of domain specific prior and the output feature . denotes the source image prior, scaled down from image-level prior to match the scale at the conv block. Similarly, PAL for the target domain images, , with the corresponding prior can be defined as,

(4)

where, we apply PAL after conv4 (=4) and conv5 (=5) block (as shown in Fig. 2). Hence, the final source and target adversarial losses can be given as,

(5)
(6)

The prior estimation networks ( and

) predict the weather-specific prior from the features extracted from

. However, the feature extractor network is trained to fool the PEN modules by producing features that are weather-invariant (free from weather-specific priors) and prevents the PEN modules from correctly estimating the weather-specific prior. Since, this type of training includes prior prediction and is also reminiscent of the adversarial learning used in domain adaptation, we term this loss as prior-adversarial loss. At convergence, the feature extractor network should have devoid itself from any weather-specific information and as a result both prior estimation networks and should not be able to correctly estimate the prior. Note that our goal at convergence is not to estimate the correct prior, but rather to learn weather-invariant features so that the detection network is able to generalize well to the target domain. This training procedure can be expressed as the following optimization,

(7)

Furthermore, in the conventional domain adaptation, a single label is assigned for the entire target image to train the domain discriminator (see Fig. 1)(c)). By doing this, it is assumed that the entire image has undergone a constant domain shift. However this is not true in the case of weather-affected images, where the degradations vary spatially (see Fig. 1)(b)). In such cases, the assumption of constant domain shift leads to incorrect alignment especially in the regions of minimal degradations. Incorporating the weather-specific priors overcomes this issue as these priors are spatially varying and are directly correlated with the amount of degradations. Hence, utilizing the weather-specific prior results in better alignment.

3.2.1 Haze prior

The effect of haze on images has been extensively studied in the literature [fattal2008single, he2011single, zhang2018densely, ren2019single, ren2016single]. Most existing image dehazing methods rely on the atmospheric scattering model for representing image degradations under hazy conditions and is defined as,

(8)

where is the observed hazy image, is the true scene radiance, is the global atmospheric light, indicating the intensity of the ambient light, is the transmission map and is the pixel location. The transmission map is a distance-dependent factor that affects the fraction of light that reaches the camera sensor. When the atmospheric light is homogeneous, the transmission map can be expressed as , where represents the attenuation coefficient of the atmosphere and is the scene depth.

Typically, existing dehazing methods first estimate the transmission map and the atmospheric light, which are then used in Eq. (8) to recover the observed radiance or clean image. The transmission map contains important information about the haze domain, specifically representing the light attenuation factor. We use this transmission as a domain prior for supervising the prior estimation (PEN) while adapting to hazy conditions. Furthermore, instead of depending on the actual ground-truth transmission maps, we use dark channel prior [he2011single] to estimate the transmission maps. Hence, no additional human annotation efforts are required for obtaining the haze prior.

3.2.2 Rain prior

Similar to dehazing, image deraining methods [Authors16, Authors18, Authors17e, liu2018d3r, yang2019joint, li2019single, li2019singlecomprehensive, zhu2017joint, hu2019depth] also assume a mathematical model to represent the degradation process and is defined as follows,

(9)

where is the observed rainy image, is the desired clean image, and is the rain residue. This formulation models rainy image as a superposition of the clean background image with the rain residue. The rain residue contains domain specific information about the rain for a particular image and hence, can be used as a domain specific prior for supervising the prior estimation network (PEN) while adapting to rainy conditions. Similar to the haze, we do not rely on the actual ground-truth rain residue. Instead, we estimate the rain residue using the rain layer prior described in [Authors16] thereby, avoiding the use of expensive human annotation efforts for obtaining the rain prior.

In both cases discussed above (haze prior and rain prior), we do not use any ground-truth labels to estimate respective priors. Hence, our overall approach still falls into the category of unsupervised adaptation. Furthermore, these priors can be pre-computed for the training images to reduce the computational overhead during the learning process. Additionally, the prior computation is not required during inference and hence, the proposed adaptation method does not result in any computational overhead.

3.3 Residual Feature Recovery Block

As discussed earlier, weather-degradations introduce distortions in the feature space. In order to aid the de-distortion process, we introduce a set of residual feature recovery blocks (RFRBs) in the target feed-forward pipeline. This is inspired from the residual transfer network method proposed in [long2016unsupervised]. Let be the residual feature recovery block at the conv block. The target domain image feedforward is modified to include the residual feature recovery block. For the feed-forward equation at the conv block can be written as,

(10)

where, indicates the feature extracted from the conv block for any image sampled from the target domain using the feature extractor network , indicates the residual features extracted from the output conv block, and indicates the feature extracted from the conv block for any image with RFRB modified feedforward. The RFRB modules are also illustrated in Fig. 2, as shown in the target feedforward pipeline. It has no effect on source feedforward pipeline. In our case, we utilize RFRB at both conv4 () and conv5 () blocks. Additionally, the effect of residual feature is regularized by enforcing the norm constraints on the residual features. The regularization loss for RFRBs, and is defined as,

(11)

3.4 Overall Loss

The overall loss for training the network is defined as,

(12)
(13)

Here, represents the feature extractor network, denotes both prior estimation network employed after conv4 and conv5 blocks, i.e., =, and = represents RFRB at both conv4 and conv5 blocks. Also, is the source detection loss, is the regularization loss, and is the overall adversarial loss used for prior-based adversarial training.

4 Experiments and Results

4.1 Implementation details

We follow the training protocol of [Saito2018StrongWeakDA, Chen2018DomainAF] for training the Faster-RCNN network. The backbone network for all experiments is VGG16 network [simonyan2014very]. We model the residuals using RFRB for the convolution blocks C4 and C5 of the VGG16 network. The PA loss is applied to only these conv blocks modeled with RFRBs. The PA loss is designed based on the adaptation setting (Haze or Rain). The parameters of the first two conv blocks are frozen similar to [Saito2018StrongWeakDA, Chen2018DomainAF]. The detailed network architecture for RFRBs, PEN and the discriminator are provided in supplementary material. During training, we set shorter side of the image to 600 with ROI alignment. We train all networks for 70K iterations. For the first 50K iterations, the learning rate is set equal to 0.001 and for the last 20K iterations it is set equal to 0.0001. We report the performance based on the trained model after 70K iterations. We set equal to 0.1 for all experiments.

In addition to comparison with recent methods, we also perform an ablation study where we evaluate the following configurations to analyze the effectiveness of different components in the network. Note that we progressively add additional components which enables us to gauge the performance improvements obtained by each of them,

  • [topsep=0pt,noitemsep,leftmargin=*]

  • FRCNN: Source only baseline experiment where Faster-RCNN is trained on the source dataset.

  • FRCNN+D: Domain adaptation baseline experiment consisting of Faster-RCNN with domain discriminator after conv5 supervised by the domain adversarial loss.

  • FRCNN+D+R: Starting with FRCNN+D as the base configuration, we add an RFRB block after conv4 in the Faster-RCNN. This experiment enables us to understand the contribution of the RFRB block.

  • FRCNN+P+R: We start with FRCNN+D+R configuration and replace domain discriminator and domain adversarial loss with prior estimation network (PEN) and prior adversarial loss (PAL). With this experiment, we show the importance of training with the proposed prior-adversarial loss.

  • FRCNN+P+R: Finally, we perform the prior-based feature alignment at two scales: conv4 and conv5. Starting with FRCNN+P+R configuration, we add an RFRB block after conv3 and a PEN module after conv4. This experiment corresponds to the configuration depicted in Fig. 2. This experiment demonstrates the efficacy of the overall method in addition to establishing the importance of aligning features at multiple levels in the network.

Following the protocol set by the existing methods [Chen2018DomainAF, Shan2018, Saito2018StrongWeakDA], we use mean average precision (mAP) socres for performance comparison.

4.2 Adaptation to hazy conditions

In this section, we present the results corresponding to adaptation to hazy conditions on the following datasets: (i) Cityscapes Foggy-Cityscapes [Sakaridis2018SemanticFS], (ii) Cityscapes RTTS [li2019benchmarking], and (iii) WIDER [yang2016wider] UFDD-Haze [nada2018pushing]. In the first two experiments, we consider Cityscapes [cordts2016cityscapes] as the source domain. Note that the Cityscapes dataset contains images captured in clear weather conditions.

Cityscapes Foggy-Cityscapes: In this experiment, we adapt from Cityscapes to Foggy-Cityscapes [Sakaridis2018SemanticFS]. The Foggy-Cityscapes dataset was recently proposed in [Sakaridis2018SemanticFS] to study the detection algorithms in the case of hazy weather conditions. Foggy-Cityscapes is derived from Cityscapes dataset by simulating fog on the clear weather images of Cityscapes. Both Cityscapes and Foggy-Cityscapes have the same number of categories which include, car, truck, motorcycle/bike, train, bus, rider and person. Similar to [Chen2018DomainAF, Saito2018StrongWeakDA], we utilize 2975 images of both Cityscapes and Foggy-Cityscapes for training. Note that we use annotations only from the source dataset (Cityscapes) for training the detection pipeline. For evaluation we consider a non overlapping validation set of 500 images provided by the Foggy-Cityscapes dataset.

We compare the proposed method with the following approaches: DA-Faster [Chen2018DomainAF], SWDA [Saito2018StrongWeakDA], DiversifyMatch [kim2019diversify], Mean Teacher with Object Relations (MTOR) [cai2019exploring], Selective Cross-Domain Alignment (SCDA) [zhu2019adapting] and Noisy Labeling [khodabandeh2019robust]. The corresponding results are presented in Table 1. It can be clearly observed that the proposed method outperforms other methods in the overall scores while achieving the best performance in most of the classes.

Method prsn rider car truc bus train bike bcycle mAP
DAFaster [Chen2018DomainAF] (CVPR18) 25.0 31.0 40.5 22.1 35.3 20.2 20.0 27.1 27.6
SCDA [zhu2019adapting] (CVPR19) 33.5 38.0 48.5 26.5 39.0 23.3 28.0 33.6 33.8
SWDA [Saito2018StrongWeakDA] (CVPR19) 29.9 42.3 43.5 24.5 36.2 32.6 30.0 35.3 34.3
DM [kim2019diversify] (CVPR19) 30.8 40.5 44.3 27.2 38.4 34.5 28.4 32.2 34.6
MTOR [cai2019exploring] (CVPR19) 30.6 41.4 44.0 21.9 38.6 40.6 28.3 35.6 35.1
NL [khodabandeh2019robust] (ICCV19) 35.1 42.1 49.2 30.1 45.3 26.9 26.8 36.0 36.5
Ours FRCNN [ren2015faster] 25.8 33.7 35.2 13.0 28.2 9.1 18.7 31.4 24.4
FRCNN+D 30.9 38.5 44.0 19.6 32.9 17.9 24.1 32.4 30.0
FRCNN+D+R 32.8 44.7 49.9 22.3 31.7 17.3 26.9 37.5 32.9
FRCNN+P+R 33.4 42.8 50.0 24.2 40.8 30.4 33.1 37.5 36.5
FRCNN+P+R 36.4 47.3 51.7 22.8 47.6 34.1 36.0 38.7 39.3
Table 1: Performance comparison for the Cityscapes Foggy-Cityscapes experiment.

(a)                              (b)

Figure 3: Visualization of features using t-sne plots of different models for Foggy-Cityscapes. (a) Model trained using only the domain adaptive loss. (b) Model trained using the prior adversarial loss. With the domain adaptive loss, the features are not perfectly aligned. Introducing the PAL loss results in better alignment.
Figure 4: Detection results on Foggy-Cityscapes. (a) DA-Faster RCNN [Chen2018DomainAF]. (b) Proposed method. The bounding boxes are colored based on the detector confidence using the color map as shown. DA-Faster-RCNN produces the detections with low confidence in addition to missing the truck class in both samples. In contrast, the proposed method is able to output high confidence detections without missing any objects.

Additionally, we present the results of different baseline experiments as discussed in Sec 4.1. It can be observed from Table 1, that the performance of source-only training of Faster-RCNN is in general poor in the hazy conditions. The use of simple domain adaptation [ganin2014unsupervised] (FRCNN+D) improves the source-only performance. The addition of RFRB (FRCNN+D+R) results in further improvements, thus indicating the importance of RFRB blocks. However, as shown in the t-SNE visualization (see Fig 3(a)), the source and target features are not aligned perfectly. This is because the conventional domain adaptation loss assumes constant domain shift across the entire image, resulting in incorrect alignment. The use of prior-adversarial loss (FRCNN+P+R) overcomes this issue. It can be seen from Fig 3(b), the features are better aligned in the case of PAL as compared to the regular domain adaptation loss. Furthermore, we achieved  3.6% improvement in overall mAP scores, thus demonstrating the effectiveness of the proposed prior-adversarial training. Note that, FRCNN+P+R baseline achieves comparable performance with state-of-the-art. Finally, by performing prior-adversarial adaptation at an additional scale (FRCNN+P+R), we achieve further improvements which surpasses the existing state-of-the-art baseline [khodabandeh2019robust] by 2.8%.

Fig. 4 shows sample qualitative detection results corresponding to the images from Foggy-Cityscapes. Results for the proposed method are compared with DA-Faster-RCNN [Chen2018DomainAF]. It can be observed that the proposed method is able to generate comparatively high quality detections.

Cityscapes RTTS: In this experiment, we adapt from Cityscapes to the RTTS dataset [li2019benchmarking] . RTTS is a subset of a larger RESIDE dataset [li2019benchmarking], and it contains 4,807 unannotated and 4,322 annotated real-world hazy images covering mostly traffic and driving scenarios. We use the unannotated 4,807 images for training the domain adaptation process. The evaluation is performed on the annotated 4,322 images. RTTS has total five categories, namely motorcycle/bike, person, bicycle, bus and car. This dataset is the largest available dataset for object detection under real world hazy conditions.

In Table  2, the results of the proposed method are compared with Faster-RCNN [ren2015faster], DA-Faster [Chen2018DomainAF] and SWDA [Saito2018StrongWeakDA]. It can be observed that the proposed method achieves an improvement of 3.1% over the baseline Faster-RCNN (source-only training), while outperforming the other recent methods.

Method prsn car bus bike bcycle mAP
FRCNN [ren2015faster] 46.6 39.8 11.7 19.0 37.0 30.9
DAFaster [Chen2018DomainAF] (CVPR18) 37.7 48.0 14.0 27.9 36.0 32.8
SWDA [Saito2018StrongWeakDA] (CVPR19) 42.0 46.9 15.8 25.3 37.8 33.5
Proposed 37.4 54.7 17.2 22.5 38.5 34.1
Table 2: Performance comparison for the Cityscapes RTTS experiment.

WIDER-Face UFDD-Haze: Recently, Nada et al. [nada2018pushing]

published a benchmark face detection dataset which consists of real-world images captured under different weather-based conditions such as haze and rain. Specifically, this dataset consists of 442 images under the haze category. Since, face detection is closely related to the task of object detection, we evaluate our framework by adapting from WIDER-Face

[yang2016wider] dataset to UFDD-Haze dataset. WIDER-Face is a large-scale face detection dataset with approximately 32,000 images and 199K face annotations. The results corresponding to this adaptation experiment are shown in Table 3. It can be observed from this table that the proposed method achieves better performance as compared to the other methods.

Method UFDD-Haze UFDD-Rain
FRCNN [ren2015faster] 46.4 54.8
DAFaster [Chen2018DomainAF] (CVPR18) 52.1 58.2
SWDA [Saito2018StrongWeakDA] (CVPR19) 55.5 60.0
Proposed 58.5 62.1
Table 3: Results (mAP) of the adaptation experiments from WIDER-Face to UFDD Haze and Rain.

4.3 Adaptation to rainy conditions

In this section, we present the results of adaptation to rainy conditions. Due to lack of appropriate datasets for this particular setting, we create a new rainy dataset called Rainy-Cityscapes and it is derived from Cityscapes. It has the same number of images for training and validation as Foggy-Cityscapes. First, we present the simulation process used to create the dataset, followed by a discussion of the evaluation and comparison of the proposed method with other methods on this new dataset.

Rainy-Cityscapes: Similar to Foggy-Cityscapes, we use a subset of 3475 images from Cityscapes to create the synthetic rain dataset. Using [synth-rain], several masks containing artificial rain streaks are synthesized. The rain streaks are created using different Gaussian noise levels and multiple rotation angles ranging between and . Next, for every image in the subset of the Cityscapes dataset, we pick a random rain mask and blend it onto the image to generate the synthetic rainy image. More details and example images are provided in supplementary material.

Cityscapes Rainy-Cityscapes: In this experiment, we adapt from Cityscapes to Rainy-Cityscapes. We compare the proposed method with recent methods such as DA-Faster [Chen2018DomainAF] and SWDA [Saito2018StrongWeakDA]. It can be observed from Table 4, that the proposed method outperforms the other methods by a significant margin. Additionally, we also present the results of the ablation study consisting of the experiments listed in Sec. 4.1. The introduction of domain adaptation loss significantly improves the source only Faster-RCNN baseline, resulting in approximately 9% improvement for FRCNN+D baseline in Table 4. This performance is further improved by 1% with the help of residual feature recovery blocks as shown in FRCNN+D+R baseline. When domain adversarial training is replaced with prior adversarial training with PAL, i.e. FRCNN+P+R baseline, we observe 2.5% improvements, showing effectiveness of the proposed training methodology. Finally, by performing prior adversarial training at multiple scales, the proposed method FRCNN+P+R observes approximately 2% improvements and also outperforms the next best method SWDA [Saito2018StrongWeakDA] by 1.6%.

Fig. 5 illustrates sample detection results obtained using the proposed method as compared to a recent method [Chen2018DomainAF]. The proposed method achieves superior quality detections.

Method prsn rider car truc bus train bike bcycle mAP
DAFaster [Chen2018DomainAF] (CVPR18) 26.9 28.1 50.6 23.2 39.3 4.7 17.1 20.2 26.3
SWDA [Saito2018StrongWeakDA] (CVPR19) 29.6 38.0 52.1 27.9 49.8 28.7 24.1 25.4 34.5
Ours FRCNN 21.6 19.5 38.0 12.6 30.1 24.1 12.9 15.4 21.8
FRCNN+D 29.1 34.8 52.0 22.0 41.8 20.4 18.1 23.3 30.2
FRCNN+D+R 28.8 33.1 51.7 22.3 41.8 24.9 22.2 24.6 31.2
FRCNN+P+R 29.7 34.3 52.5 23.6 47.9 32.5 24.0 25.5 33.8
FRCNN+P+R 31.3 34.8 57.8 29.3 48.6 34.4 25.4 27.3 36.1
Table 4: Performance comparison for the Cityscapes Rainy-Cityscapes experiment.
Figure 5: Detection results on Rainy-Cityscapes. (a) DA-Faster RCNN [Chen2018DomainAF]. (b) Proposed method. The bounding boxes are colored based on the detector confidence using the color map as shown. DA-Faster-RCNN misses several objects in both the samples. In contrast, the proposed method is able to output high confidence detections without missing any objects.

WIDER-Face UFDD-Rain: In this experiment, we adapt from WIDER-Face to UFDD-Rain [nada2018pushing]. The UFDD-Rain dataset consists of 628 images collected under rainy conditions. The results of the proposed method as compared to the other methods are shown in Table 3. It can be observed that the proposed method outperforms the source only training by  7.3%, while achieving the best results among the recent methods.

Due to space constraints, we provide additional details about the proposed method including results and analysis in supplementary material.

5 Conclusion

We addressed the problem of adapting object detectors to adverse weather conditions. Based on the observation that weather conditions cause degradations that can be mathematically modeled and cause spatially varying distortions in the feature space, we propose a novel prior-adversarial loss that aims at producing weather-invariant features. Additionally, a set of residual feature recovery blocks are introduced to learn residual features that can aid efficiently aid the adaptation process.The proposed framework is evaluated on several benchmark datasets such as Foggy-Cityscapes, RTTS and UFDD. Through extensive experiments, we show that our method achieves significant gains over the recent methods in all the datasets.

References

6 Supplementary Material

This contains the supplementary material for the paper Prior-based Domain Adaptive Object Detection for Adverse Weather Conditions. Due to the space limitations in the submitted paper, in this supplementary material we provide additional details like network configuration, details of the newly introduced Rainy-Cityscapes dataset, additional analysis and discussion about results.

6.1 Additional Results

Results with ResNet-152

Table 5 shows the additional results on the CityscapesFoggy-Cityscapes experiments, when ResNet-152 network architecture is used as backbone of detection network. From the results we can see that ResNet-152 performs better compared to the corresponding VGG16 baselines. For FRCNN+P+R baseline with ResNet-152, residual feature recovery blocks and prior estimation networks are applied on fourth and fifth conv block of the network. The results in Table 5 show that the proposed approach generalizes well for different network architectures.

Method prsn rider car truc bus train bike bcycle mAP
DAFaster 25.0 31.0 40.5 22.1 35.3 20.2 20.0 27.1 27.6
SCDA 33.5 38.0 48.5 26.5 39.0 23.3 28.0 33.6 33.8
SWDA 29.9 42.3 43.5 24.5 36.2 32.6 30.0 35.3 34.3
DM 30.8 40.5 44.3 27.2 38.4 34.5 28.4 32.2 34.6
MTOR 30.6 41.4 44.0 21.9 38.6 40.6 28.3 35.6 35.1
NL 35.1 42.1 49.2 30.1 45.3 26.9 26.8 36.0 36.5
Ours (VGG16) FRCNN 25.8 33.7 35.2 13.0 28.2 9.1 18.7 31.4 24.4
FRCNN+D 30.9 38.5 44.0 19.6 32.9 17.9 24.1 32.4 30.0
FRCNN+D+R 32.8 44.7 49.9 22.3 31.7 17.3 26.9 37.5 32.9
FRCNN+P+R 33.4 42.8 50.0 24.2 40.8 30.4 33.1 37.5 36.5
FRCNN+P+R 36.4 47.3 51.7 22.8 47.6 34.1 36.0 38.7 39.3
Ours (ResNet-152) FRCNN 32.4 42.2 36.0 19.8 26.4 4.7 22.7 32.6 27.1
FRCNN+P+R 34.9 46.4 51.4 29.2 46.3 43.2 31.7 37.0 40.0
Table 5: Performance comparison for the Cityscapes Foggy-Cityscapes experiment.
Figure 6: Performance sensitivity of proposed approach with varying parameter.

Parameter Sensitivity

In Fig.6, we provide sensitivity of the proposed approach with respect to parameter. The parameter controls the effect of regularization applied on residual feature norm coming from residual feature recovery blocks. The parameter sensitivity experiment was performed for adaptation from CityscapesFoggy-Cityscapes with VGG16 network architecture as backbone of detection network.

6.2 Rainy-Cityscapes

Fig. 7 shows sample images from Rainy-Cityscapes dataset introduced in the paper.

Figure 7: Sample images from the Rainy-Cityscapes dataset.

6.3 Network configurations

The network configuration details of different modules such as Residual Feature Recovery Blocks (RFRB) and Prior Estimation Network (PEN) are shown in Table 6 and 7.

Prior Estimation Network
Gradient Reversal Layer
Conv, 1

1, 64, stride 1, BN, ReLU

Conv, 3 3, 64, stride 1, BN, ReLU
Conv, 3 3, 64, stride 1, BN, ReLU
Conv, 3 3, 3, stride 1, Tanh
Table 6: Network configuration details for Prior Estimation Networks.
Residual Feature Recovery Block - Conv4 Residual Feature Recovery Block - Conv5
Maxpool, 2 2, stride 2 Maxpool, 2 2, stride 2
Conv, 3

3, 256, stride 1, padding 1, ReLU

Conv, 3 3, 512, stride 1, padding 1, ReLU
Conv, 3 3, 512, stride 1, padding 1, ReLU Conv, 3 3, 512, stride 1, padding 1, ReLU
Conv, 3 3, 512, stride 1, padding 1 Conv, 3 3, 512, stride 1, padding 1
Table 7: Network configuration details for Residual Feature Recovery Blocks.

6.4 Qualitative Results

CityscapesFoggy-Cityscapes

Fig. 8 shows sample results on Foggy-Cityscapes.

Figure 8: Detection results on Foggy-Cityscapes. (a) DA-Faster RCNN [Chen2018DomainAF] (b) Proposed method. The bounding boxes are colored based on the detector confidence using the color map as shown. As we can see from the figures, the proposed method is able to produce high confidence predictions and is able to detect more objects in the image.

CityscapesRainy-Cityscapes

Fig. 9 shows sample results on Rainy-Cityscapes.

Figure 9: Detection results on Rainy-Cityscapes. (a) DA-Faster RCNN [Chen2018DomainAF] (b) Proposed method. The bounding boxes are colored based on the detector confidence using the color map as shown. As we can see from the figures, the proposed method is able to produce high confidence predictions and is able to detect more objects in the image.