To See in the Dark: N2DGAN for Background Modeling in Nighttime Scene

12/12/2019 ∙ by Zhenfeng Zhu, et al. ∙ 10

Due to the deteriorated conditions of lack and uneven lighting, nighttime images have lower contrast and higher noise than their daytime counterparts of the same scene, which limits seriously the performances of conventional background modeling methods. For such a challenging problem of background modeling under nighttime scene, an innovative and reasonable solution is proposed in this paper, which paves a new way completely different from the existing ones. To make background modeling under nighttime scene performs as well as in daytime condition, we put forward a promising generation-based background modeling framework for foreground surveillance. With a pre-specified daytime reference image as background frame, the GAN based generation model, called N2DGAN, is trained to transfer each frame of nighttime video to a virtual daytime image with the same scene to the reference image except for the foreground region. Specifically, to balance the preservation of background scene and the foreground object(s) in generating the virtual daytime image, we present a two-pathway generation model, in which the global and local sub-networks are well combined with spatial and temporal consistency constraints. For the sequence of generated virtual daytime images, a multi-scale Bayes model is further proposed to characterize pertinently the temporal variation of background. We evaluate on collected datasets with manually labeled ground truth, which provides a valuable resource for related research community. The impressive results illustrated in both the main paper and supplementary show efficacy of our proposed approach.



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Background modeling originates in numerous applications in computer vision, especially in video surveillance

[1, 2, 3, 4].

Figure 1: Flowchart of three kinds of background modeling methods for foreground object detection in nighttime surveillance video. (a) Conventional background modeling. (b) Enhancement based background modeling. (c) Generation based background modeling.

In the last decades, state-of-the-art approaches for background modeling have been proposed for visual surveillance under daytime scenes. On a whole, they are popularly dominated by a family of statistical based methods. Among them, GMM [4]

is the most well-known approach that characterizes each background pixel by a K-Gaussians mixture model. As a density estimation method, KDE


uses kernel density estimate technique to describe the temporal distribution for each pixel. Besides, Codebook

[3] and ViBe [1] are also two representative methods that achieve good performance for modeling the background.

However, all these conventional methods face quite a challenge in the case of illumination lack and uneven lighting at night, especially in the presence of dynamic background, change of light, and some extreme weather conditions such as rain, snow and fog. As we can see from Fig.1 (a), the conventional background modeling methods, like GMM, etc., fail to distinguish the foreground object from background due to the deteriorated condition of illumination lack. To deal with such a case, an intuitive way as shown in Fig.1 (b) is to perform image enhancement through first, and then build background model on the bases of the enhanced frames ’s, just like background modeling under daytime scene. However, since these enhancement methods [5, 6, 7] are not task-driven, they usually lose sight of inter-frame consistency. Thus, it’s difficult for them to establish an unified background model under nighttime scene.

In addition, in manuscript preparation, we also noticed several most recent proposed deep learning based works

[36, 37, 38] for background modeling. Compared with our fully unsupervised model, all of them are supervised, requiring many manually labeled data for model training. Thus, these models obviously lack of scalability to a scene unknown beforehand, which also means their performances greatly depend on the collected training dataset. For this reason, they are not taken as baseline approaches for comparison.

To address this issue, we made a novel contribution in integrating generative model into background modeling. Fig.1 (c) shows the proposed generation-based background modeling framework. With a pre-specified daytime reference image as ground-truth background frame, the generation model needs to be trained for transferring each frame of nighttime video to a virtual daytime image with the same scene to the reference image except for the foreground region. Furthermore, on the basis of generated sequence of virtual daytime images, the background model can be built to obtain with the detected foreground object. In fact, the unique reference image plays a significant role for enforcing the pixel-wise temporal consistency of inter-frames in the generation of virtual daytime images.

To the best of our knowledge, this paper is one of the first attempts to introduce GANs based deep learning network for background modeling. In summary, the following points highlight several contributions of the paper:

  • This paper proposes a reasonable and innovative solution to the longstanding problem of background subtraction. To make background model under nighttime scene work as well as in daytime condition, a promising generation-based background modeling framework, i.e., N2DGAN is proposed by applying state-of-the-art GAN advances.

  • To simultaneously preserve background scene and the foreground object(s) in generating the virtual daytime image, we present a two-pathway generation model, in which the global and local sub-networks are seamlessly combined with spatial and temporal consistency constraints.

  • For the sequence of generated virtual daytime images, a multi-scale Bayes model is proposed to characterize pertinently the temporal variation of background. Thus, while suppressing effectively noise coming from virtual daytime image generation, we can ensure the favorable detection of foreground objects.

  • We collect a benchmark dataset including indoor and outdoor scenes with manually labeled ground truth, which can serve as a good benchmark for the research community.

Ii Related Work

Nighttime Image Enhancement. Here we simply divide image enhancement methods into two categories: reference based and non-reference based methods.

Non-reference based methods mainly focus on how to improve low contrast images. A naive method, Histogram Equalization [6], spreads out the most frequent intensity values, thus gaining a higher contrast for the areas of lower contrast. However, it pays less attention to the shape of input histogram and shows bad performance in uneven illumination images. The purpose of Retinex based image enhancement [5] is to estimate the illumination from original image, thereby decomposing reflectance image and eliminating the influence of uneven illumination to improve the visual effect of the original image.

Reference based methods [8, 9, 10]

usually combine images of a scene at different time intervals by image fusion. These methods usually produce unnatural effects in the enhanced images. Besides, it would increase signal to noise ratio, which is adverse for further video analysis and applications such as foreground detection.

Generative Adversarial Nets. As a novel way to train generative models, GANs proposed by Goodfellow et al. [11] has received more and more attention in recent years. In [12], GANs is applied for image completion with globally and locally consistent adversarial training. [13]

uses back-propagation on a pretrained image generative network for image inpainting. To transfer the original image into a cartoon style, Domain Transfer Network (DTN)

[14] is proposed. Ledig et al. [15]

have proposed a super resolution generative adversarial network (SRGAN), where a deep residual network

[16] is employed. Recently, CycleGAN [17]

has been proposed to deal with unpaired image-to-image translation, and performs well in style transfer, object transfiguration, season transfer, and photo enhancement, etc. In our previous work

[18],, a generative adversarial networks (GANs) based framework for nighttime image enhancement was proposed.

Background Modeling Algorithms. Broadly speaking, background modeling methods can be divided into two categories: pixel-based methods and block-based methods.

One of the most popular pixel-based methods is Gaussian mixture models(GMM) [19, 20, 4]

. It models the distribution at each pixel observed over time using a summation of weighted Gaussian distribution. Such methods generally perform well with the multi-modal nature of many practical situations. However, if high or low frequency changes appear in the background, the model can’t be adaptively tuned in time and even may miss some information about fast moving objects. Consequently, Elgammal et al.


have developed a non-parametric background model, which estimates the probability of observing pixel intensity values based on a sample of intensity values for each pixel.

Block-based methods [22, 23] divide each frame into multiple overlapped or non-overlapped small blocks, and then model the background using the features of each block. Compared with pixel-based methods, the image blocks can capture more spatial distribution information, which make block-based methods insensitive to the local shift in the background. However, the detection performance will largely depend on the block-dividing technique, especially for small moving targets.

Figure 2: The network architecture of N2DGAN. is the input nighttime image, and we divide it into blocks () as the input of each local generator. Then, the output of global generator and local generators are concatenated together. Finally, after two convolutional layers, the output is the daytime image , where is the reference daytime image. More details about our model architecture are provided in the supplemental materials.

Iii Foreground Object Preserving Virtual Daytime Scene Generation

To maintain spatial and temporal consistency in the generation process, our goal is to train a generation model as in Fig.1 to transfer each frame of nighttime video to a virtual daytime image with the same scene to the unique reference image except for the foreground region. Specifically, this generation problem can be formulated as:


where is the number of training pairs, denotes a weighted combination of several loss components.

For the intent of learning the generation function , motivated by the powerful generating ability of GANs [11], we propose a novel GAN based two path-way network N2DGAN with a generator network and a discriminator network . Thus, the objective in Eq.(1) becomes to optimize the following min-max problem:


where is the real daytime image distribution, and is the nighttime image distribution. Here we adopt the same formulation for Eq.(2) as in WGAN [24], is the set of 1-Lipschitz function, and weight clipping is utilized to enforce the Lipschitz constraint. Besides, we optimize the above min-max problem by alternately training and .

Iii-a Architecture

The overview on the network architecture of the proposed N2DGAN is shown in Fig.2. To leverage the preserving of background scene and foreground object(s) in generating virtual daytime image, a two-pathway generator is proposed with global sub-network for maintaining background scene and local sub-networks attending to capture local foreground information. As illustrated in Fig.2

, both the global and local sub-networks are designed in an Encoder-Decoder manner with modules of norm Convolution-BatchNorm-Relu, and each layer is followed by three residual blocks

[16]. Following the architecture of ”U-Net”[17], a fusion subnet is also designed to connect both the ”Encoder” and the ”Decoder” since symmetric layers can share some common information. This will be helpful for facilitating the information flow of foreground object between the input and output in the network chain. The details of the model architecture are provided in the supplemental materials.

Figure 3: Illustration of the proposed multi-scale N2DGAN for foreground detection. In the training phase, that is, the background modeling phase, for input frame , we extend N2DGAN to generate multi-scale daytime images , then model the temporal distribution of each pixel of the generated image at each scale. is the pre-specified daytime reference image and is discriminator network for scale . In the testing phase, that is, foreground detection phase, for the input frame , the pre-trained N2DGAN network outputs its corresponding multi-scale daytime images . Then the multiple probabilities of each pixel belonging to the background at each scale are integrated elegantly based on Bayes inference for detecting the foreground object.

Iii-B Loss Function

The loss function

in Eq.(1) plays a significant role in training GANs model. For an input nighttime image , several kinds of loss functions are exploited to make the generated virtual daytime image retain most of the image information such as structure, objects, and texture as in the pre-specified reference image .

Adversarial loss. To encourage the generated images move towards the real daytime image manifold and generate images with more details, the adversarial loss is first considered for distinguishing the generated image from the daytime image .


where is the number of nighttime images in the training dataset.

Perceptual loss. In order to minimize the high-level perceptual and semantic differences between and while preventing unexpected overfitting coming from , we follow the idea that minimizes the difference in convolutional layer of a pre-trained network [25]

between two images. The motivation behind it is the neural network pre-trained by image classification task has already learnt effective representation, which can be transferred into other tasks such as our enhancement processing. Specifically, we define

as the activation of the convolutional layer of the pre-trained network, and the perceptual loss is defined as:


where is the number of convolutional layers, and describe the dimensions of the respective feature maps within the VGG network.

Pixel wise loss. To facilitate further background modeling task with pixel-wise spatial and temporal consistency constrains, the most widely used pixel-wise MSE loss (Eq.(5)) and total variation loss (Eq.(6)) are also adopted.


Since each of the loss functions mentioned above is provided with an unique view on characterizing the visual quality of the generated virtual image, an intuitive way is to make a combination of them. Thus, we have the final overall loss function as:


where , , , and are weights of the corresponding terms, respectively.

Iv Multi-scale Generation for Background Modeling


N2DGAN ensures that there is a detectable difference between foreground object and background. All these characters match the major premise of GMM, that the background is more frequently visible than the foreground and that its variance is significantly slight. However, as we know, the neural network has the properties of both randomness and uncertainty. Thus, there exists inevitably pixel-wise difference between

and in generating , and it essentially can be regarded as some kind of random noise arising from both spatial and temporal domains. In other words, given the total error value, this difference at pixel may also occur at any other pixels with equal probability. This case will be doomed to bring some unexpected negative influence on pixel-level background modeling.

To mitigate this issue, inspired by some works on multi-scale product for edge detection [33, 34] that tends to yield significant localized detection, we extend N2DGAN to a multi-scale generative model as shown in Fig.3 to facilitate the background modeling to be noise-free.

Iv-a Formulation

As illustrated in Fig.3, we reformulate the generation problem for background modeling as follows: to train a generator network parametrized by , which learns a mapping function from the source domain of nighttime to the target domain of daytime. For every input nighttime frame , to generate a multi-scale set of images , , in daytime domain will be equivalent to:


where is the number of training pairs, and denotes the loss function as mentioned above. In addition, we introduce adversarial discriminators to distinguish the reference daytime image from the generated virtual daytime image under scale . Particularly, similar to Eq.(2), each generation task of different scales can be reformulated as the following min-max problem:


It should be noted that both the generative network architecture and the discriminator architecture in Fig.3 are same as those in Fig.2. But different from N2DGAN, more convolutional layers are employed to generate multi-scale daytime images.

Iv-B Multi-scale Foreground Detection

N2DGAN enforces the sequence of generated virtual daytime images to be as close as possible to the pre-specified unique reference frame . The inescapable fact, however, is that there is certain difference between them, one is noise accompanied by the neural network, the other part is the foreground region. To suppress effectively noise coming from virtual daytime generation while strengthening the discriminant of foreground object, a multi-scale Bayes model is proposed to characterize pertinently the temporal variation of background.

For each pixel , we use to serve the background model, denoting the probability of pixel to be background. Given the multi-scale representations for pixel , the background model can be given with Bayes criterion by Eq.(10). Here, represents rounding down to the nearest whole number. On the assumption that the generation of virtual daytime images with different scales is independent of each other, thus the background model given by Eq.(10) will further reduce to the following Eq.(IV). As we can see from Eq.(IV), the background model is equivalent to the multi-scale product of multiple background models at different scales. In addition, for each background model , , a single gaussian model instead of GMM can be simply applied as shown in Fig.3.

V Experimental Results

We evaluate the proposed background modeling approach visually and quantitatively, by comparing with state-of-the-arts and providing extensive ablation studies

V-a Datasets and Experiment Settings

Datasets and evaluation metrics. Our work in this paper mainly focuses on background modeling under nighttime scene with low illumination. However, to the best of our knowledge, there are no public open datasets to evaluate such a task. For this reason, we collect several benchmark datasets by a Canon IXY 210F video camera including indoor and outdoor scenes with manually labeled ground truth. The details about our four datasets, including Lab, Tree, Lake1, and Lake2, are shown in Tab.I  111The datasets and code of our method will be released in . For each dataset, the corresponding pre-specified daytime images that serve as ground truth background frames are also provided. It is worth of noting that both the ’Lake’ and ’Tree’ datasets are captured outdoor on windy nights, and the ’Lab’ dataset is taken indoors where we control the intensity of lighting by pulling curtains and switching incandescent lights on purpose. Actually, these datasets are much challenging for background modeling task since they feature the undulation of lake, reflection of lights in the water, leaves shaking, and illumination variation. In order to make a quantitative evaluation, the foreground object(s) in the datasets are also manually labeled. Following previous works on foreground detection, IoU

is employed as our evaluation metric 


Implementation details. The training of the proposed N2DGAN model is implemented on NVIDIA TITAN Xp GPUs. The first frames of each nighttime video in Tab.I paired with the corresponding daytime reference image are used to train the model, and the remaining are for testing. All of the images are downscaled to resolution of . Specially, we split each image into multiple image blocks with size , and then each block is used as the input of each local generator subnet. Considering the computational efficiency, only two scales are adopted, i.e.,

, to eliminate the influence of noise and spatial shift of background pixels. Based on RMSProp, the mini-batch gradient descent method is used with a batch size of

and a learning rate of . Since WGAN [24] is used as the backbone of our generation model, the weight need to be clampped to a fixed box after each gradient update to avoid gradient vanishing and mode collapse problems during the learning process. In all our experiments, we empirically set , , ,and in Eq.(7) to maintain the same order of magnitude. Besides, the first convolutional layers of VGG network is used to calculate perceptual loss.

Datasets Lab Tree Lake1 Lake2
#Frames 693 722 579 1482
Table I: The details of our four datasets.
Figure 4: Qualitative comparison of different foreground detection methods. (a) nighttime image, (b) groundtruth, (c) our method N2G-GAN, (d) GMG [31], (e) ASOM [28], (f) FASOM [27], (g) LOBSTER [26], (h) GMM [4], (i) MCueBGS [40], (j) SUBSENSE [29], (k) MRF-UV [32], (l) HE [6]+GMM [4], (m) MSR [5]+GMM [4], (n) HE [6]+SUBSENSE [29], (o) MSR [5]+SUBSENSE [29].

V-B Performance Evaluation

Comparisons with State-of-the-arts. We compare the proposed method with state-of-the-arts on foreground detection, including eight typical background modeling methods and four enhancement-based methods. Implementations of all these methods are based on the BGSLibrary [39] with default parameters. The quantitative comparison results on sequences are shown in Tab.II. Obviously, our method can always achieve the best performance. Specifically, for sequences ’Tree2’ and ’Lake2’, our method even outperforms the state-of-the-art method SUBSENSE [29] by 75% and 44%, respectively. Fig.4 presents the qualitative comparison results, which illustratively show that the proposed N2GGAN performs better. This mainly comes from two facts: (1) Compared to directly background modeling method, generating daytime images makes the flatten pixel distribution sharper and easier to detect foreground. (2) Compared to enhancement-based method, the unique daytime reference frame in generative process ensures the inter-frame consistency of the generated daytime images.

Methods Accuracy()
Tree Lake1 Lake2
Typical methods GMG[31]
Ours N2DGAN(global)
Table II: Quantitative comparison of foreground detection accuracy by different methods.
Figure 5: Sequence stability comparison with HE, MSR and N2DGAN.

Stability comparison. Consistency stability of successive frames is of great importance for further background modeling. For the -th frame , we use the following metric to measure the stability between and its adjacent frame .


where denotes the distance between and

. Here, we adopt Kullback-Leibler Divergence, which represents stronger stability if its value is close to

. Fig.5 show the stability comparison of several representative methods on a randomly selected sequence consisting of 200 consecutive frames from the three test set. By contrast, the result of our method (yellow line) is quite close to real nighttime images (blue line), while both HE (red line) and MSR (green line) show distinct difference from real sequence. Particularly, the large fluctuation by MSR also indicates that the pixel values between two adjacent frames differ greatly, which is unfavorable for background modeling. To sum up, with the joint constrain of spatial and temporal consistency, our generation based method performs better than enhancement-based method.

Figure 6: Foreground detection accuracy on Lake1 dataset with different levels of noises.

V-C Robustness to Noise and Illumination Variation

Rain, snow and fog usually bring great challenges to background modeling. To demonstrate N2DGAN’s scalability under such environment, additive Gaussian noise is added to nighttime video sequence to simulate extreme weather. We randomize the noise standard deviation

separately for each testing example. As illustrated in Fig.6, the behaviors are quantitatively different in all three datasets. This demonstrates that our method is the only technique that manages to perform well with different levels of noises. On two randomly selected frames Lake1-13th and Lake1-158th, Fig. 7 and Fig. 8 present visually the comparison results with conventional background modeling methods and enhancement based methods, which clearly shows that the proposed N2DGAN achieves the best performance. Meanwhile, a demo (noise.avi) in the attachment shows the performance of N2DGAN under different levels of noises on the whole Lake1 dataset. This video intuitively illustrates that our method manages to perform steadily under noise condition.

For the experiments on evaluating the robustness to illumination variation, Fig. 9 illustrates the comparison results on the indoor nighttime dataset Lab with illumination variation. As we can see from Fig.9, the N2DGAN model is much more insensitive to the instantaneous changes in light compared with other state-of-the-art background modeling methods. Here, the enhancement result based on HE (Fig. 9(b)) is only utilized to clarify the foreground object since it is not easy to find the ground truth in dark. Besides, to better demonstrate the gradual change of indoor illumination, we upload two demos (illumination 1.avi and illumination 2.avi)222More clear illumination change can be observed in illumination 1.avi in the attachment, which also indicate that illumination variation has little effect on the performance of N2DGAN model.

Two factors must be credited for our high resilience to noise and illumination change. The first originates from our model design, which allows noisy pixel to be outfitting to the reference background image. The second lies in our background model on successive multi-scale images, which is more robust to noise by hierarchical Bayes modeling.

Figure 7: Performance evaluation of robustness to noise on Lake1-13th. (a) The input nighttime image. (b) Groundtruth. (c) N2DGAN. (d) GMG[31], (e) IMBGS[30]. (f)ASOM [28], (g)FASOM [27], (h)LOBSTER [26], (i)GMM [4], (j)SUBSENSE [29], (k)MRF-UV [32], (l) HE [6]+GMM [4], (m) MSR [5]+GMM [4], (n) HE [6]+SUBSENSE [29], (o) MSR [5]+SUBSENSE [29].
Figure 8: Performance evaluation of robustness to noise on Lake1-158th. (a) The input nighttime image. (b) Groundtruth. (c) N2DGAN. (d) GMG[31], (e) IMBGS[30]. (f)ASOM [28], (g)FASOM [27], (h)LOBSTER [26], (i)GMM [4], (j)SUBSENSE [29], (k)MRF-UV [32], (l) HE [6]+GMM [4], (m) MSR [5]+GMM [4], (n) HE [6]+SUBSENSE [29], (o) MSR [5]+SUBSENSE [29].
Figure 9: Performance evaluation of robustness to illumination variation. (a) nighttime image, (b) HE, (c) N2GGAN, (d) GMG [31], (e) ASOM [28], (f) FASOM [27], (g) LOBSTER [26], (h) GMM [4], (i) MCueBGS [40], (j) SUBSENSE [29], (k) MRF-UV [32], (l) HE [6]+GMM [4], (m) MSR [5]+GMM [4], (n) HE [6]+SUBSENSE [29], (o) MSR [5]+SUBSENSE [29].

V-D Ablation Study

Global and Local Consistency Evaluation. To verify the effectiveness of combining both local and global consistency together in our model, we first perform foreground detection when using global subnetwork alone, called N2DGAN(global). As observed from Fig.10, small targets in nighttime images are lost in this case. Meanwhile, when using local subnet alone, called N2DGAN(local), the detected foreground objects are incomplete on the edge of patches, since there exists blocking-artifact problem caused by patch enhancement. Additionally, quantitative comparison results shown in Tab.II (bottom) demonstrate that our baseline improves detection accuracy by more than 10% and 5% compared with N2DGAN(global) and N2DGAN(local), respectively.

Figure 10: Comparison of foreground detection results using global subnet and local subnet independently. (a) Input nighttime image. (b) Ground truth. (c) N2DGAN. (d) N2DGAN(global). (e) N2DGAN(local).

Multi-scale Bayes modeling Evaluation. To further demonstrate the effectiveness of our background modeling on successive multi-scale images of daytime domain, we attempt to perform on a single scale generated images . As illustrated in Fig.11, due to the fact that multi-scale bayes model can suppress the noise caused by network, then our baseline N2DGAN makes the foreground region more remarkable.

Figure 11: Comparison of foreground detection results using multi scales and single sale. (a) nighttime image. (b) Generated image . (c) Groundtruth. (d) N2DGAN. (e) The detection result using single scale .

V-E Time Complexity Analysis

For a background model, the computational complexity is one of the key issues worthy of attention. For our N2DGAN model, its computational complexity mainly consists of two parts, i.e., virtual daytime image generation and foreground detection. In the foreground detection stage, since our model holds only a single gaussian model which can be off-line available and without need for online model updating, thus the time complexity in this stage is much lower than the traditional GMM and can be negligible compared with the one in generation stage.

By comparison, to generate a virtual daytime image will occupy most of the time with a frame rate of fps without any code optimisation. As shown in Table III, the frame rate of N2DGAN(local) to generate blocks of local sub-images333For an input nighttime image, it is divided into blocks and then the corresponding local generation sub-network is trained on each block to generate a local sub-image. For detail please refer to Sect.3 of the submitted manuscript is around fps, which is much more slowly than N2DGAN(global) with a frame rate of 56 fps. However, considering that we can generate each block of local virtual sub-image in parallel, the frame rate of N2DGAN(local) will dramatically increased, approximating around fps in an ideal situation. It means that the global generation process with fps will dominate the overall computational complexity of the virtual daytime image generation. By this way, the need for online real-time foreground object detection can be met.

Time Complexity (fps)
N2DGAN N2DGAN(global) N2DGAN(local)
8 56 10
Table III: Time complexity analysis of generation process.

Vi Conclusion

For the challenge of background modeling under daytime scene, an innovative N2DGAN model is proposed, which paves a new way completely different from the existing methods. To the best of our knowledge, this is the first time to introduce GANs based deep learning for this practical problem. As an unsupervised model, N2DGAN is provided with good scalability and practical significance. As for the time complexity of N2DGAN, it takes about seconds ( fps) for each frame. Considering each local generation model can be implemented in parallel, the proposed N2DGAN could be highly parameterizable. Besides, some model compression works[41, 42] can also be feasible solutions for network acceleration.


The authors would like to thank the anonymous reviewers for their constructive comments and valuable suggestions on this paper.


The detailed structures of the global sub-network and local sub-network are provided in Table IV and Table V, respectively. Each convolution layer is followed by residual block [16]

. Then we simply concatenate the output from each local generator and the global generator to produce a fused feature tensor and then feed it to the successive convolution layers to generate the final output.

Layer Inputs


cov1 cov0
cov2 cov1
decov0 cov2
decov1 cov0
decov2 cov2
Table IV: Architecture of the local generator sub-network.
Layer Inputs Kernel/Stride Outputs
cov1 cov0
cov2 cov1
cov3 cov2
cov4 cov3
cov5 cov4
fc1 cov5 -
fc2 fc1 -
decov0 fc2
decov1 decov0
decov2 decov1
decov3 decov2
decov4 decov3
decov5 cov5
Table V: Architecture of the global generator sub-network.


  • [1] O. Barnich and M. Van Droogenbroeck, ”ViBe: a universal background subtraction algorithm for video sequences”, IEEE Transactions on Image Processing, vol.20, no. 6, pp. 1709-1724, 2011
  • [2] A. Elgammal, R. Duriswami, D. Harwood, and L. S. Davis, Background and foreground modelling using nonparametric kernel density estimation for visual surveil, Proceedings of the IEEE, vol. 90, no. 7, pp. 1151-1163, 2002
  • [3] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis, Real-time foreground-background segmentation using codebook model, Real-Time Imaging, vol. 11, no. 3, pp. 172-185, 2005
  • [4] Z. Zivkovic, Improved adaptive gaussian mixture model for background subtraction,

    International Conference on Pattern Recognition

    , 2004
  • [5] D. J. Jobson, Z. Rahman, and G. A. Woodell, A multiscale retinex for bridging the gap between color images and the human observation of scenes, IEEE Transactions on Image Processing, vol. 6, no. 7, pp. 965-976, 2002
  • [6] D. J. Ketcham, Real-time image enhancement techniques, Osa Image Processing, vol. 74, no. 2, pp. 120-125, 1976
  • [7] G. Yan, Y. Lee, and T. Q. Nguyen, Nighttime image enhancement applying dark channel prior to raw data from camera, Soc Design Conference, 2016
  • [8] Y. Cai, K. Huang, T. Tan, and Y. Wang, Context enhancement of nighttime surveillance by image fusion, International Conference on Pattern Recognition, 2006
  • [9] W. Liang, K. Murari, Y. Y. Zhang, Y. Chen, X. D. Li, and M. J. Li, Image-based fusion for video enhancement of night-time surveillance, Optical Engineering, vol. 49, no. 12, pp. 120501-120501-3, 2012
  • [10] R. Raskar, A. Ilie, and J. Yu, Image fusion for context enhancement and video surrealism, International Symposium on Non-Photorealistic Animation and Rendering, 2004
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, Advances in Neural Information Processing Systems, 2014.
  • [12] S. Iizuka, E. Simo-Serra, and H. Ishikawa, Globally and locally consistent image completion, ACM Trans. Graph., vol. 36, no. 4, pp. 1-14, 2017
  • [13] R. Yeh, C. Chen, T. Y. Lim, M. Hasegawa-Johnson, and M. N. Do, Semantic image inpainting with perceptual and contextual losses, IEEE Conference on Computer Vision and Pattern Recognition, 2016
  • [14] Y. Taigman, A. Polyak, and L. Wolf, Unsupervised cross-domain image generation, International Conference on Learning Representations, 2016
  • [15] C. Ledig, L. Theis, F. Husz ar, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, Photo-realistic single image super-resolution using a generative adversarial network., IEEE Conference on Computer Vision and Pattern Recognition, 2017
  • [16] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, IEEE Conference on Computer Vision and Pattern Recognition, 2016
  • [17] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, IEEE International Conference on Computer Vision, 2017
  • [18] Yingying Meng, Deqiang Kong, Zhenfeng Zhu, Yao Zhao, From Night to Day: GANs based Low Quality Image Enhancement, Neural Processing Letters, vol. 50, No.1 799-814, 2019
  • [19] P. Kaewtrakulpong and R. Bowden, An improved adaptive background mixture model for real-time tracking with shadow detection, Video-Based Surveillance Systems, Springer, Berlin, pp. 135-144, 2002
  • [20] P. Kaewtrakulpong and R. Bowden, Adaptive background mixture models for real-time tracking, IEEE Conference Computer Vision and Pattern Recognition, 1999
  • [21] A. M. Elgammal, D. Harwood, and L. S. Davis, Non-parametric model for background subtraction, European Conference on Computer Vision, 2000
  • [22] M. Heikkila and M. Pietikainen, A texture-based method for modeling the background and detecting moving objects, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 657–662, 2006
  • [23] S. Liao, G. Zhao, V. Kellokumpu, M. Pietik ainen, and S. Z. Li, Modeling pixel process with scale invariant local patterns for background subtraction in complex scenes, IEEE Conference on Computer Vision and Pattern Recognition, 2010
  • [24] M. Arjovsky, S. Chintala, and L. Bottou, Wasserstein generative adversarial networks,

    International Conference on Machine Learning

    , 2017
  • [25]

    L. A. Gatys, A. S. Ecker, and M. Bethge, Image style transfer using convolutional neural networks,

    IEEE Conference on Computer Vision and Pattern Recognition, 2016
  • [26] P.L. St-Charles and G. A. Bilodeau, Improving background subtraction using local binary similarity patterns, IEEE Winter Conference on Applications of Computer Vision (WACV), 2014
  • [27] L. Maddalena and A. Petrosino, A fuzzy spatial coherence-based approach to background/foreground separation for moving object detection, Neural Computing and Applications, vol. 19, no. 2, pp. 179–186, 2010
  • [28] L. Maddalena, A. Petrosino, et al., A self-organizing approach to background subtraction for visual surveillance applications, IEEE Transactions on Image Processing, vol. 17, no. 7, pp. 1168-1177, 2008
  • [29] P.-L. St-Charles, G.-A. Bilodeau, and R. Bergevin, Flexible background subtraction with self-balanced local sensitivity, IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014
  • [30] D. Bloisi and L. Iocchi, Independent multimodal background subtraction., Computational Modelling of Objects Represented in Images Fundamentals Methods and Applications III, pp. 39–44, 2012
  • [31] A. B. Godbehere, A. Matsukawa, and K. Goldberg, Visual tracking of human visitors under variable-lighting conditions for a responsive audio art installation, American Control Conference (ACC), 2012
  • [32] Z. Zhao, T. Bouwmans, X. Zhang, and Y. Fang, A fuzzy background modeling approach for motion detection in dynamic backgrounds, Multimedia and Signal Processing, 2012
  • [33] A. Rosenfeld, Y. H. Lee, and R. B. Thomas, Edge and curve detection for texture discrimination, Picture Processing and Psychopictorics, pp. 381–393, 1970
  • [34] A. Rosenfeld, Y. H. Lee, and R. B. Thomas, Edge and curve detection for visual scene analysis, IEEE Transactions on Computers, vol. 20, no. 5, pp. 562–569, 1971,
  • [35] Z. Liu, K. Huang, T. Tan, et al, Foreground object detection using top-down information based on EM framework., IEEE Transaction on Image Processing, vol. 21, no. 9, pp. 4204-4217, 2012
  • [36] L. A. Lim and H. Y. Keles, Foreground segmentation using a triplet convolutional neural network for nultiscale feature encoding, CoRR, abs/1801.02225, 2018
  • [37] L. Yang, J. Li, Y. Luo, Y. Zhao, and H. Cheng, Deep background modeling using fully convolutional network, IEEE Transaction on Intelligent Transportation Systems, vol. 19, no. 1, pp. 1-9, 2017
  • [38] D. Zeng and M. Zhu, Background subtraction using multiscale fully convolutional network, IEEE Access, vol. 6, 16010–16021, 2018
  • [39] A. Sobral, BGSLibrary: An OpenCV C++ background subtraction Library, IX Workshop de Vis?o Computacional (WVC’2013), Rio de Janeiro, Brazil, 2013
  • [40] S. J. Noh and M. Jeon, A new framework for background subtraction using multiple cues, Asian Conference on Computer Vision, 2012
  • [41]

    K. Jia, D. Tao, S. Gao, and X. Xu, Improving training of deep neural networks via singular value bounding,

    IEEE Conference on Computer Vision and Pattern Recognition, 2017
  • [42] X. Yu, T. Liu, X. Wang, and D. Tao, On compressing deep models by low rank and sparse decomposition, IEEE Conference on Computer Vision and Pattern Recognition, 2017