Resources of semantic segmantation based on Deep Learning model
In this work, we present a novel background subtraction system that uses a deep Convolutional Neural Network (CNN) to perform the segmentation. With this approach, feature engineering and parameter tuning become unnecessary since the network parameters can be learned from data by training a single CNN that can handle various video scenes. Additionally, we propose a new approach to estimate background model from video. For the training of the CNN, we employed randomly 5 percent video frames and their ground truth segmentations taken from the Change Detection challenge 2014(CDnet 2014). We also utilized spatial-median filtering as the post-processing of the network outputs. Our method is evaluated with different data-sets, and the network outperforms the existing algorithms with respect to the average ranking over different evaluation metrics. Furthermore, due to the network architecture, our CNN is capable of real time processing.READ FULL TEXT VIEW PDF
Foreground segmentation in video sequences is a classic topic in compute...
In this paper, we introduce a spectral-domain inverse filtering approach...
Deep convolutional neural networks have been widely employed as an effec...
In this paper we investigate how state-of-the-art change detection algor...
This paper proposes a novel algorithm to reassemble an arbitrarily shred...
With the increased size and frequency of wildfire eventsworldwide, accur...
Determining a waterline in images recorded in canoe sprint training is a...
Resources of semantic segmantation based on Deep Learning model
With the tremendous amount of available video data, it is important to maintain the efficiency of video based applications to process only relevant information. Most video files contain redundant information such as background scenery, which costs a huge amount of storage and computing resources. Hence, it is necessary to extract the meaningful information, e.g. vehicles or pedestrians, to deploy those resources more efficiently.
Background subtraction is a binary classification task that assigns each pixel in a video sequence with a label, for either belonging to the background or foreground scene varadarajan2015regionmog ; st2015subsense ; barnich2011vibe .
Background subtraction, which is also called change detection, is applied to many advanced video applications as a pre-processing step to remove redundant data, for instance in tracking or automated video surveillance sajid2015colorspacessinglegaussian . In addition, for real-time applications, like tracking, the algorithm should be capable of processing the video frames in real-time.
One simple example of the application of a background subtraction method is the pixel-wise subtraction of a video frame from its corresponding background image. After being compared with the difference threshold, pixels with a larger difference than a certain threshold value are labeled as foreground pixels, otherwise as background pixels. Unfortunately, this strategy will yield poor segmentation due to the dynamic nature of the background, that is induced by noise or illumination changes. For example, due to lighting changes, it is common that even pixels belonging to the background scene can have intensities very different from their other pixels in the background image and they will be falsely classified as foreground pixels as a consequence. Thus, sophisticated background subtraction algorithms that assure robust background subtraction under various conditions must be employed.
The main difficulties that complicate the background subtraction process are:
Illumination changes: When scene lighting changes gradually (e.g. moving clouds in the sky) or instantly (e.g. when the light in a room is switched on), the background model usually has a illumination different from the current video frame and therefore yields false classification.
Dynamic background: The background scene is rarely static due to movement in the background (e.g. waves, swaying tree leaves), especially in outdoor scenes. As a consequence, parts of the background in the video frame do not overlap with the corresponding parts in the background image, hence, the pixel-wise correspondence between image and background is no longer existent.
Camera jitter: In some cases, instead of being static, it is possible that the camera itself is frequently in movement due to physical influence. Similar to the dynamic background case, the pixel locations between the video and background frame do not overlap anymore. The difference in this case is that it also applies to non-moving background regions.
Camouflage: Most background subtraction algorithms work with pixel or color intensities. When foreground objects and background scene have a similar color, the foreground objects are more likely to be (falsely) labeled as background.
Night Videos: As most pixels have a similar color in a night scene, recognition of foreground objects and their contours is difficult, especially when color information is the only feature in use for segmentation.
Ghosts/intermittent object motion: Foreground objects that are embedded into the background scene and start moving after background initialization are the so-called ghosts. The exposed scene, that was covered by the ghost, should be considered as background. In contrast, foreground objects that stop moving for a certain amount of time, fall into the category of intermittent object motion. Whether the object should be labeled as foreground or background is strongly application dependent. As an example, in the automated video surveillance case, abandoned objects should be labeled as foreground.
Hard shadows: Dark, moving shadows that do not fall under the illumination change category should not be labeled as foreground.
In this work, we follow the trend of Deep Learning and apply its concepts to background subtraction by proposing a CNN to perform this task. We justify this approach with the fact that background subtraction can be performed without temporal information, given a sufficiently good background image. With such a background image, the task itself breaks down into a comparison of a image-background pair. Hence, the input samples can be independent among each other, enabling a CNN to perform the task with only spatial information. The CNN is responsible for extracting the relevant features from a given image-background pair and performs segmentation by feeding the extracted features into a classifier. In order to get more spatially consistent segmentation, post-processing of the network output is done by spatial-median filtering and or a fully connected CRF framework. Due to the use of a CNN, no parameter tuning or descriptor engineering is needed.
To train the network, a large amount of labeled data is required. Fortunately, due to the process of background subtraction, by comparing an image with its background, it is not necessary to use images of a full scene for training. It is also possible to train the network via subsets of a scene, i.e. with patches of image-background pairs, since the procedure also holds for image patches. As a consequence, we can extract enough training samples from a limited amount of labeled data.
To the best of our knowledge, background subtraction algorithms that use CNN are scene specific to this day, i.e. a CNN can only perform satisfying background subtraction on a single scene (that was trained with scene specific data) and also lacks the ability to perform the segmentation in real time. Our proposed approach yields a universal network that can handle various scenes without having to retrain it every time the scene changes. As a consequence, one can train the network with data from multiple scenes and hence increase the amount of training data for a network. Also, by using the proposed network architecture, it is possible to process video frames in real-time with conventional computing resources. Therefore, our approach can be considered for real-time applications.
The outline of this paper is as follows: In Section 2, early and recent algorithms for background subtraction are presented. In Section 3, we explain our proposed approach for background subtraction. Here, we first describe our proposed approach to estimate background image and next we illustrate our CNN for background subtraction. In Sections 4, we describe the experimental evaluation of the algorithm including the used datasets, the evaluation metrics and the obtained results followed by detailed discussion and analysis. Finally, in Section 5, we conclude our work and provide future work with some ideas.
The majority of background subtraction algorithms are composed of several processing modules which are explained in the following sections (see Figure 1).
Background Model: The background model is essential for the background subtraction algorithm. In general, the background model is used as a reference to compare with the incoming video frames. Furthermore, the initialization of the background model plays an important role since video sequences are normally not completely free of foreground objects during the bootstrapping phase. As a consequence, the model gets corrupted by including foreground objects into the background model, which leads to false classifications.
Background Model Maintenance: In reality, background is never completely static but changes over time. There are many strategies to adapt the background model to these changes by using the latest video frames and/or previous segmentation results. Trade-offs must be found in the adaption rate, which regulates how fast the background model is updated. High adaption rate leads to noisy segmentation due to the sensitivity to small or temporary changes. Slow adaption rate, however, yields an outdated background model and therefore false segmentation. Selective update adapts the background model with pixels that were classified as background. In that case, deadlock situations can occur by not incorporating misclassified pixels into the background model, i.e. once a background pixel is falsely classified as foreground, it would never be used to update the background and would always be considered as a foreground pixel. On the other hand, by using all pixels as in the blind update strategy, such deadlock situations can be prevented but will also distort the background model since foreground pixels are incorporated into the background model.
Feature extraction: In order to compare the background image with the video frames, adequate features that represent relevant information must be selected. Most algorithms use gray scale or RGB intensities as features. In some cases, pixel intensities along with other hand engineered features (e.g. heikkila2006lbp or bilodeau2013lbsp ) are combined. Also, the choice of the feature region is important. One can extract the features over pixels, blocks or patterns. Pixel-wise features often yield noisy segmentation results since they do not encode local correlation, while block-wise or pattern-wise features tend to be insensitive to slight changes.
Segmentation: With the help of a background model, the respective video frames can be processed. Background segmentation is performed by extracting the features from corresponding pixels or pixel regions of both frames and using a distance measure, e.g. the Euclidean distance, to measure the similarity between those pixels. After being compared with the similarity threshold, each pixel is either labeled as background or foreground.
The combination of those building blocks forms an overall background subtraction system. The robustness of the system is always dependent and limited by the performance of each individual block, i.e. it can not be expected to perform well if one module delivers poor performance.
Background subtraction is a well studied field, therefore there exists a vast number of algorithms for this purpose (see Figure. 2). Since most of the top performing methods at present are based on the early proposed algorithms, some of which are outlined in the beginning. Subsequently a few of the current methods for background subtraction will be introduced.
Stauffer and Grimson stauffer1999mog
proposed a method that models the background scene with Mixture of Gaussian (MoG), also called Gaussian Mixture Model (GMM). It is assumed that each pixel in the background is drawn from a Probability Distribution Function (PDF) which is modeled by a GMM, also pixels are assumed to be independent from their neighboring pixels. Incoming pixels from video frames are labeled as background if there exists a Gaussian in the GMM, where the distance between its mean and the pixel lies within a certain bound. For learning the parameters, that maximize the likelihood, the authors proposed an online method that approximates the Expectation Maximization (EM) algorithmdempster1977em_algorithm .
introduced a probabilistic non-parametric method to model the background. Again it is assumed that each background pixel is drawn from a PDF. The PDF for each pixel is estimated with Kernel Density Estimation (KDE).
largest eigenvalues. Incoming images are compared with their projection onto the eigenvectors. After calculating the distances between the image and the projection and comparing them with the corresponding threshold value, foreground labels are assigned to pixels with large distances.
use GMM to model the background scene, in addition, single Gaussians are employed for foreground modeling. By computing flux tensorsbunyak2007fluxtensor , which depict variations of optical flow within a local 3D spatio-temporal volume, blob motion is detected. With the combination of the different information from blob motion, foreground models and background models, moving and static foreground objects can be spotted. Also, by applying edge matching evangelio2011edge_matching , static foreground objects can be classified as ghosts or intermittent motions.
groups with the K-means algorithm. For each group and each pixel location a single Gaussian model is built by calculating the mean and variance for a cluster. For incoming pixels, the matching models are selected by taking the models among themodels that show the highest normalized cross-correlation to the pixels. The segmentation is done for each color channel for RGB and YCbCr color representations, which yields 6 segmentation masks. By combining all available segmentation masks the background segmentation is then performed.
as additional features to pixel intensities and slight changes in the update heuristic of the thresholds and the background model.
In this section, we introduce our proposed method that consists of 1) a novel algorithm for background model (image) generation; 2) a novel CNN for background subtraction and 3) post-processing of the networks output using median filter. The complete system is illustrated in Figure 3. We use a background image to perform background subtraction from the incoming frames. Matching pairs of image patches from the background image and the video frames are extracted and fed into a CNN. After reassembling the patches into the complete output frame, it is post-processed, yielding the final segmentation of the respective video frame.
In the following sections, we introduce our algorithm to get the background images from the video frames. Furthermore, we illustrate our CNN architecture and the training procedure. At last, we discuss our employed post-processing strategies to improve the network output.
We propose a new approach to generate background model which is illustrated in Figure. 5. Here we combine the segmentation mask from SuBSENSE algorithm st2015subsense and the output of Flux Tensor algorithm wang2014fluxtensormog , which can dynamically change the parameters used in the background model based on the motion changes in the video frames. The block diagram of the robust background model algorithm is given in Figure. 5. The details of each block is explained in the following sections.
Perhaps, the simplest way to get background image from a video sequence is to perform pixel-wise temporal median filtering. However, using this method, the background model will be quickly corrupted if there are a lot of moving objects in some video frames, and hence, the pixel values of foreground object will eventually negatively affect the quality of the background model. This requires us to distinguish the foreground pixels and background pixels, and for the background model we only use the background pixel values to perform temporal median filter. To this end, we use the SuBSENSE algorithm. This method relies on spatio-temporal binary features as well as color information to detect changes. This allows camouflaged foreground objects to be detected more easily while most illumination variations are ignored. Besides, instead of using manually-set, frame-wide constants to dictate model sensitivity and adaptation speed, it uses pixel-level feedback loops to dynamically adjust the method’s internal parameters without user intervention. The adjustments are based on the continuous monitoring of model fidelity and local segmentation noise levels. The output of SuBSENSE algorithm will be a binary image which contains the classification information of the current video frame. The foreground objects are marked as white pixel, and black pixels represent the pixels belonging to background model.
Based on the foreground mask image from SuBSENSE, we build a background pixel library,, for each pixel location in the frame. The idea is that we only store the pixel values from the current frame in the background pixel library, when they are classified by SuBSENSE algorithm as background pixels. An illustration of background pixel library is shown in Figure. 7.
Here, we only keep the last 90 background pixel values from the video sequences. After the library is complete, the oldest background pixel value is replaced with the newest one. For this purpose, we have an indicator, , which is a pointer that points to the oldest background pixel value in the library. Each time, if a pixel value is classified by SuBSENSE as background, then it will be stored in the background library at the location where points to, and then we move to the next position in the library. To generate the background image, we calculate the average value over a certain memory length of the pixel values in the background pixel library. The memory length is defined using the variable . The background pixel at location and at color channel is defined by , whose value is calculated using following equation
But if we use a fixed memory length over the entire video sequence, we will get either blurry or outdated background model if the camera is moving or if there are some intermittent objects in the video sequence. These two cases are illustrated in Figure. 8. As we can see in the following figure, the first row shows the scenario that the camera is constantly rotating, the calculated background image is shown on the right hand side. In this case, the background image becomes very blurry if we use a fixed average length , or in the second row, the car on the left side stops at the same location for 80 percent of the frames in the video sequence, so the car will naturally be classified as background by SuBSENSE. Then if the car starts to move, the calculated background image using fixed memory length will not be updated because the pixel values of the car are still stored in the background library.
In order to have adaptive memory length based on the motion of the camera and objects in the video frames, we need to have a motion detector. A commonly used motion detector is the method using flux tensor. Compared with standard motion detection algorithms, the advantage of using flux tensor is that the motion information can be directly computed without expensive eigenvalue decompositions. Flux tensor represents the temporal variation of the optical flow field within the local 3D spatio-temporal volume wang2014fluxtensormog , where in the expanded matrix form, flux tensor is written as
The elements of the flux tensor incorporate information about temporal gradient changes which leads to efficient discrimination between stationary and moving image features. The trace of the flux tensor matrix which can be compactly written and computed as
can be directly used to classify moving and non-moving regions without eigenvalue decompositions. In our approach, the output of Flux Tensor algorithm is a binary image, which contains the motion information of the video frames. An example is shown in Figure. 9. The white pixel in the binary image indicates that the pixel at this location is moving either temporally or spatially.
Next, we define a new variable called as:
where represents the number of white pixels in Flux tensor while and represent the width and height of the image respectively. The variable presents how many percent of the pixels in the current video frame are moving. Large means either the camera is moving or there is a large object in the frame which starts to move. In this case, we need to decrease the memory length . If is relative small, then it means the background is steady and we can use a large memory length to suppress the noise. Using , we dynamically increase or decrease the value of . The relation between and is given as follows
In order to avoid the noise in and resulting noise in , there is a low pass filter structure applied to the value of , the low pass filter is defined as follows
Here, means the value of at time . Note that the different value of is based on the fact whether is increasing or decreasing. The reason for this is that after a dramatic decrease of we want to increase slower, in order to let be updated with new background pixel values.
Due to the low quality of some surveillance cameras, there will be some undesired pixel values around moving objects in the form of semi-transparency and motion blur, which are illustrated in Figure. 10.
In these two cases, the pixels near segmentation mask of SuBSENSE will probably be false negatives. This means that even the pixels near the segmentation mask are classified as background, but they are actually corrupted by foreground moving pixels with semi-transparency and motion blur. In this case, we should not add this background pixel to the background library. For this purpose we need to perform a padding around the foreground mask with the size defined by using variable, which means if a pixel value is classified as background but the pixels around it within the radius of path size are classified as foreground, then this background pixel will be disregarded. The should also be dynamically adjusted using the output of flux tensor. For instance, if is large, then we need to increase , because with more moving pixels the phenomenon of semi-transparency and motion blur will also increase. Figure. 11 shows the comparison between the background model from SuBSENSE and the robust background model obtained by the proposed approach.
We train the proposed CNN with background images obtained by the SuBSENSE algorithm st2015subsense . Both networks are trained with pairs of RGB image patches from video and background frames and the respective ground truth segmentation patches. Before introducing the network architecture, we outline the data preparation step for the network training. Afterwards, we will illustrate our architecture for our CNN and explain the training procedure in detail.
For the training of the CNN, we use random data samples (around 5 percent) from the CDnet 2014 data-set wang2014cdnet , which contains various challenging video scenes and their ground truth segmentation. We prepare one set of background images that is obtained by the proposed algorithm. Since we want the network to learn representative features, we only utilize video scenes that do not challenge the background model, i.e. the background scene should not change significantly in a video. Therefore, we exclude all videos from certain categories for training (see Table 1).
We work with RGB images for background and video frames. Before we extract the patches, all employed frames are resized to the dimension and the RGB intensities are rescaled to . Furthermore, zero padding is applied before patch extraction to avoid boundary effects. The training data consist of triplets of matching patches from video, ground truth and background frames of size
that are extracted with a stride of 10 from the employed training frames. An example mini-batch of training patches is shown in Figure.12.
As widely recommended, we perform mean subtraction on the image patches before training, but we discard the division by the standard deviation, since we are working with RGB data and therefore each channel has the same scale.
|CDnet 2014 wang2014cdnet categories|
The architecture of the proposed CNN is illustrated in Figure 13
. The network contains 3 convolutional layers and a 2-layer Multi Layer Perceptron (MLP). We use the Rectified Linear Unit (ReLU) as activation function after each convolutional layer and the Sigmoid function after the last fully connected layer. In addition, we insert batch normalization layersioffe2015batchnormalization before each activation layer. A batch normalization layer stores the running average of the mean and standard deviation from its inputs. The stored mean is subtracted from each input of the layer and also division by the standard deviation is performed. It has been shown that by applying batch normalization layers, over-fitting is decreased and also higher learning rates for training are achieved.
We train the networks with mini batches of size 150 via RMSProphinton2012lecturermsprop and a learning rate of
. For the loss function, we choose the Binary Cross Entropy (BCE), which is defined as follows:
The BCE is calculated between the network outputs and the corresponding vectorized ground truth patches of size. Boundaries of foreground objects and pixels that do not lie in the region of interest are marked in the ground truth segmentations. These pixels are ignored in the cost function.
The spatial-median filtering, which is a commonly used post-processing method in background subtraction, returns the median over a neighborhood of given size (the kernel size) for each pixel in an image. As a consequence, the operation gets rid of outliers in the segmentation map and also performs blob smoothing (see Figure14). After applying the spatial-median filter on the network output, we globally threshold the values for each pixel in order to map each pixel to . The threshold function is given by
where R is the threshold level.
In order to evaluate our approach, we conducted several experiments on various data-sets. At first, we introduce the utilized data-sets for performance testing. Afterwards, the evaluation metrics are presented, which measure the quality of the segmentation outputs. The results on the evaluation data are subsequently reported. Furthermore, we analyze the network behavior during training. Additionally, we visualize the convolutional filters and generated feature maps at the end.
We employ multiple data-sets to perform our tests. The CDnet 2014 wang2014cdnet , the Wallflower toyama1999wallflower and the PETS 2009 data-set ferrymanpets2009 . The CDnet 2014 and also the Wallflower data-set were specifically designed for the background subtraction task. These data-sets contain video sequences from different categories, which correspond to the main challenges in background subtraction and also hand segmented ground truth images are also provided. The PETS 2009 ferrymanpets2009 data-set was designed for other purposes, such as the evaluation of tracking algorithms, and therefore no ground truth segmentations are available for its video sequences. The employed data-sets will be described in the following.
The CDnet 2014 wang2014cdnet that was used for training of our CNNs will also be used for performance evaluation. With additional video sequences under new challenge categories, it is an extension of the CDnet 2012 goyette2012changedetection data-set, which is the predecessor of the CDnet 2014 wang2014cdnet .
For each video sequence in the data-set, corresponding ground truth segmentation images are available. For the newly added categories in CDnet 2014 wang2014cdnet (see Table 2), only half of the ground truth segmentations were provided to avoid parameter tuning of background subtraction algorithms on the benchmark data-set. One has to refer to the online evaluation111http://changedetection.net in order to get the results over all ground truth segmentations.
|Categories||Video Sequences||CDnet 2012 goyette2012changedetection||CDnet 2014 wang2014cdnet|
Another data-set that we employ for evaluation purpose is the Wallflower data-set toyama1999wallflower . For each category in the data-set, there exists a single video sequence (see Table 3). Also, for each video sequence, a hand segmented ground truth segmentation image is provided. Hence, when evaluating background subtraction algorithms on the Wallflower data-set toyama1999wallflower , only a single ground truth segmentation is considered.
|Video sequences/categories||Number of frames||
The PETS 2009 data-set ferrymanpets2009 is a benchmark for tracking of individual people within a crowd. It consists of multiple video sequences, recorded from static cameras, and different crowd activities. Since the data-set is not designed for background subtraction evaluation, there are no ground truth segmentation images for this purpose. Thus, only the qualitative segmentation results will be evaluated without calculating any evaluation metric. A sample image from each category is represented in Fig. 15.
In order to measure the quality of a background subtraction algorithm, we evaluate the performance by comparing the output segmentations with the groundtruth segmentations to get the following statistics:
True Positive (TP): Foreground pixels in the output segmentation that are also foreground pixels in the ground truth segmentation.
False Positive (FP): Foreground pixels in the output segmentation that are not foreground pixels in the ground truth segmentation.
True Negative (TN): Background pixels in the output segmentation that are also background pixels in the ground truth segmentation.
False Negative (FN): Background pixels in the output segmentation that are not background pixels in the ground truth segmentation.
Using these statistics, different evaluation metrics are calculated that are outlined in the following:
F Measure (FM):
False Positive Rate (FPR):
False Negative Rate (FNR):
Percentage of Wrong Classifications (PWC):
We are especially interested in the FM metric since most state-of-the-art algorithms in background subtraction typically exhibit higher FM values than worse performing background subtraction algorithms. This is due to the combination of multiple evaluation metrics for the calculation of the FM. Thus, the overall performance of a background subtraction algorithm is highly coupled with its FM performance.
In Section 3
, we have introduced our background subtraction system, consisting of a CNN, a background image retrieval and a post-processing module. For the background image retrieval and the post-processing technique, respectively two methods were proposed. The novel background image generation based on the SuBSENSEst2015subsense and Flux tensor was proposed and for the post-processing, spatial-median filtering is considered. In order to get the best performing setup, we calculate the evaluation metrics, presented in Section 4.2, over the CDnet 2014 wang2014cdnet , for each setup of our background subtraction system. Also, we compare the category-wise FM and the overall FM for those data-sets with the FMs from other background subtraction algorithms. Afterward, we will select the best performing setup for further comparison. For the Wallflower data-set toyama1999wallflower , we calculate the FM for the best setup and again, compare the values with those from other algorithms. Due to the missing ground truth images for the videos in the PETS 2009 data-set ferrymanpets2009 , we can not derive the numeric values for the FM and therefore only evaluate the segmentation outputs. In order to compare our approach with other algorithms on multiple data-sets, we need to be able to generate the corresponding segmentation, using different algorithms. For the CDnet 2014 wang2014cdnet , all algorithms that are listed in the online ranking222http://changedetection.net, their evaluation metrics and segmentation results are available. For the other data-sets, namely the PETS 2009 ferrymanpets2009 and the Wallflower data-set toyama1999wallflower , we need to explicitly generate the segmentation images for those video sequences. For this purpose, we employed the BGSLibrary bgslibrary as for the background image generation. As a consequence, we compare our method with certain algorithms that were online evaluated for the CDnet 2014 wang2014cdnet , as well as implemented in the BGSLibary bgslibrary .
As already mentioned in Section 3, we train the network with mini-batches of size 150, a learning rate
over 10 epochs. The training data comprises of 150 images per video sequence in the categories listed in Table1. The validation data contains 20 frames per video sequence. We train the network with RMSprop hinton2012lecturermsprop using the BCE for the loss function. In Figure 16, the training plot is illustrated. Both networks yield a similar behavior and performance during training, mainly due to the identical network architecture and training setup.
In order to measure the quality of our background subtraction algorithm, we compute the different evaluation metrics for our background subtraction system. The metrics are reported for the CDnet 2014 wang2014cdnet . Also, we compare our FMs with the FMs from other background subtraction algorithms over the complete evaluation data. For this purpose, we compare our algorithm with the GMM stauffer1999mog , PBAS hofmann2012pbas and the SuBSENSE st2015subsense algorithm. In addition, for both CDnet data-sets, we employ further background subtraction algorithms for the FM comparison. Also, we compare the segmenation outputs among the algorithms. For the comparison, we select further video sequences from the PETS 2009 data-set ferrymanpets2009 .
In the following, for each setup of our background subtraction system, the evaluation metrics for the CDnet 2014 are listed in Tables 4. The outputs are post-processed with a median filter with a size of . The outputs of the CNN are compared with the threshold at . As we can see in Table 4, the CNN yields very good results. In addition, the CNN yields the highest FM among the other algorithms in 6 out of 11 categories (see Table 5 and Table 6 ).
|Intermittent Object Motion||0.5735||0.9949||0.0051||0.4265||4.1292||0.6098||0.8251|
For the the Wallflower data-set toyama1999wallflower , we compare the FMs among the considered algorithms. For each video in the data-set, only a single groundtruth image is given, therefore, the FM is calculated only for this ground truth image.
The different FMs are reported in Table 7. Please note that we do not consider the ”MovedObject” video for comparison since the corresponding ground truth image contains no foreground pixels and therefore, the FM can not be calculated.
From the results we can see that the CNN yields the best overall FM among the considered background subtraction algorithms.
|FMVideo||CNN||SuBSENSE st2015subsense||PBAS hofmann2012pbas||GMM stauffer1999mog|
The output segmentations are already precise and not very prone to outliers (e.g. through dynamic background regions), the main problems are the camouflage regions within foreground objects.
In the first 6 categories of CDnet 2014, where mainly the feature extraction and segmentation are challenging, the CNN gives the best results, compared with other algorithms, in 6 out of 11 categories. For categories such as the ”PTZ (Pan Tilt Zoom)” or ”low framerate” category, the CNN yields poor results since the background images provided by the proposed algorithm are insufficient for performing good background subtraction. Therefore, our method gives poor segmentation for these categories and as a consequence, the average FM drops significantly.
However, for Wallflower data-set toyama1999wallflower , where in most cases the background model is not challenged, our algorithm outperforms all other algorithms in terms of segmentation quality and overall FM.
To sum up, our system outperforms the existing algorithms when the challenge does not lie in the background modeling/maintenance, but on the other hand, due to the corruption of the background images, our method performs poorly once there are large changes in the background scenery.
For further understanding and intuition, we visualize the convolutional filters and the feature maps when an image pair is fed into the network. In order to replace feature engineering, filters are employed which are learned during training. Since the learning is performed with ground truth data, highly application specific features will be extracted. Our network contains 3 convolutional layers which perform the feature extraction, especially the convolutional filters, the so-called kernels, are responsible for that task. Some of those will be illustrated in the following. All convolutional filters in our network are of size . The filters that are directly applied to the RGB image pair of input and background image are illustrated in Figure 19.
Due to the large number of filters, we only show the filter-sets that output the first feature map of the respective convolutional layer. Those filter-sets from the remaining convolutional layers are shown in Figure 20. By applying the filters onto the input images or feature maps, the convolutional layer scans the input for certain patterns, that are defined by the filters, that again generate new feature maps which capture task-specific characteristics.
By applying the learned filters to the network input, passing the intermediate values through the activation and subsampling functions, feature maps are generated. Those are again the input of the subsequent convolutional layer or classifier. In order to analyze the generated feature maps, we first need to feed an input into the network. An example input pair consisting of background and input frame is shown along with the CNN output in Figure 18. The output feature maps from all convolutional layers in our network are illustrated in Figures. 19 and 20. At the end, the vectorized feature maps of the last convolutional layer in our CNN are fed into a MLP that performs the pixel-wise classification.
We have proposed a novel approach based on deep learning for background subtraction task. The proposed approach includes three processing steps, namely background model generation, CNN for feature learning and post-processing. It turned out that the CNN yielded the best performance among all other algorithms for background subtraction. In addition, we analyzed our network during training and visualized the convolutional filters and generated feature maps. Due to the fact that our method works only with RGB data, the potential of Deep Learning approaches and the available data, one could think of modeling the background with Deep Learning techniques, for example with RNN. Furthermore, as we use a global threshold for our network outputs at the moment, one could use adaptive pixel-wise thresholds, as in the PBAS algorithm, employing a kind of ”background dynamics” measure for the feedback loop. With this adaption, one could increase the sensitivity in static background areas and decrease it for areas with dynamic background. At last, when combining our method with an existing background subtraction algorithm, in our case with the SuBSENSE algorithm, one could use the information of both output segmentations and combine them to get a refined output and employ this output for improving the updates of the background model.
European conference on computer vision, pages 751–767. Springer, 2000.
2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 1–8. IEEE, 2012.