Efficient Background Modeling Based on Sparse Representation and Outlier Iterative Removal

by   Linhao Li, et al.
Tianjin University

Background modeling is a critical component for various vision-based applications. Most traditional methods tend to be inefficient when solving large-scale problems. In this paper, we introduce sparse representation into the task of large scale stable background modeling, and reduce the video size by exploring its 'discriminative' frames. A cyclic iteration process is then proposed to extract the background from the discriminative frame set. The two parts combine to form our Sparse Outlier Iterative Removal (SOIR) algorithm. The algorithm operates in tensor space to obey the natural data structure of videos. Experimental results show that a few discriminative frames determine the performance of the background extraction. Further, SOIR can achieve high accuracy and high speed simultaneously when dealing with real video sequences. Thus, SOIR has an advantage in solving large-scale tasks.



There are no comments yet.


page 3

page 7

page 9

page 10

page 11

page 12


Video Rain/Snow Removal by Transformed Online Multiscale Convolutional Sparse Coding

Video rain/snow removal from surveillance videos is an important task in...

High Dimensional Robust Sparse Regression

We provide a novel -- and to the best of our knowledge, the first -- alg...

Efficient Outlier Removal for Large Scale Global Structure-from-Motion

This work addresses the outlier removal problem in large-scale global st...

Removing Rain in Videos: A Large-scale Database and A Two-stream ConvLSTM Approach

Rain removal has recently attracted increasing research attention, as it...

A Deep-Unfolded Reference-Based RPCA Network For Video Foreground-Background Separation

Deep unfolded neural networks are designed by unrolling the iterations o...

Towards Omni-Supervised Face Alignment for Large Scale Unlabeled Videos

In this paper, we propose a spatial-temporal relational reasoning networ...

Makeup216: Logo Recognition with Adversarial Attention Representations

One of the challenges of logo recognition lies in the diversity of forms...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Background modeling of a video is a key part in many vision-based applications, such as real-time tracking CW1999 ; HCQ2011 ; KLM2012 , information retrieval and video surveillance T2011 ; AOM2006 . In a video, which consists of a series of frames, there are some sceneries staying almost constant,although being polluted by noise SHL2012 . A model for extracting the invariable part is important. It can help us to handle the video, especially in the public scenes DH2013 . In some cases, background modeling is an essential step in the task of foreground detection MO2012 ; VN2008 . Once we extract the background, we can detect or even track the foreground information, just by comparing the incoming frame with the learned background model AOM2006 .

In the traditional background modeling problem, something make the background non-stationary, such as the fluttering flags, the waving leaves and the ripple water, etc. YCCY2007 . Besides, there are still some other issues SC2004 , like signal noise, sudden lighting changes and shadows STS2011 ; AT2008 , prevent us from distinguishing the background and the foreground. In addition, with the development of technologies and the improvement of equipments, a new problem appears in many practical applications. The data of a practical background modeling problem becomes larger and larger. Thus the time consumption and the required memory become key issues of an effective algorithm.

A large number of background modeling methods have been reported in the literature over the past few decades. Most researchers regard the pixel series as the features, and set up pixel-wise models. One idea is to model each pixel series by the Gaussian distribution. Two pioneering works are the single Gaussian model in 1997

CATA1997 and the Multiple of Gaussian (MOG) in 1999 CW1999 . Based on the two works, some improved algorithms also achieved good performance PR2002 ; Z2004 ; D2005 in the following few years. Besides the idea of Gaussian distribution, the clustering methods are also used to model the background, i.e.the codebook KTDL2004 ; KTDL2005 ; JYCC2011 ; AM2011 and time-series clusteringAN2012 . Furthermore, the non-parameter method was proposed in 2000 ADL2000 and was improved in 2012 EMA2012 , which had shown competitive performance. Recently, a new method (ViBe) was proposed in 2012 and was improved afterwards, which performs better than mainstream pixel-wise techniques MO2012 ; OM2011 . The mentioned methods solve the problem by setting model for each pixel and initialize all the models in the training process. Although higher accuracy can be obtained based on sufficient training data, the speed will be restricted by the size of the data. Thus there are constraints between the precision and the speed. Besides, it is also a challenging task to obtain ’good enough’ training data.

Another category deals with the background modeling problem at the region level. Some works pay attentions to the local region, and different corresponding features are proposedSTS2011 ; LWIQ2004 ; MM2006 ; LWXG2010 ; SGV2010 . Furthermore, the global region based algorithms have achieved better performance than the others. In 2000, Oliver et al. NBA2000

first modeled the background by Principal Component Analysis (PCA). This method models the background by projecting the high-dimensional data into a lower dimensional subspace. Recently, robust PCA (RPCA) in 2010

ZXJEY2010 and Principal Component Pursuit(PCP) in 2011 EXYJ2011

have shown their superiority than the original PCA. Hence some heuristic methods are raised

GA2011 ; CTE2012 ; XCW2013

in the following few years. These models get rid of the training process, and can gain all the information contained in an arbitrarily given data. However, to solve these models, the Singular Value Decomposition(SVD) is an essential step, and it is well known that SVD is very time consuming. Then the speed and the required memory is sensitive to the scale of the data. Thus, these models are limited in large-scale problems.

Figure 1: The framework of Sparse Outliers Iterative Removal Algorithm (SO-IR).

In this paper, we propose a Sparse Outliers Iterative Removal (SO-IR) algorithm. SO-IR meets the demand of solving large-scale problems, and achieves high accuracy as well as high speed simultaneously. In our algorithm, to solve the large-scale problems, we introduce sparse representation(or sparse coding) into the work of modeling the background. We explore the ’discriminative’ frames that are much less and are ’discriminative’ enough to model the background. Besides, we propose a cyclic iteration, which is composed of a tensor-wise PCA model and a pixel-wise outlier removal strategy, to extract the background from the ’discriminative’ frames. The mentioned two parts are a whole, and we call it Sparse Outliers Iterative Removal (SO-IR) algorithm. The framework of it is shown in Fig.1. In addition, we also detect the foreground object by the Markov Random Field(MRF). Experiments show that SO-IR outperforms most mainstream algorithms both in the speed and in the precision, especially in large scale problems. The main contributions can be summarized as follows:

  • We utilize sparse representation to refine the frames of the video. In the work of background modeling, too many redundant frames usually deteriorate the performance. By working on the selected frame set, which is composed of the discriminative frames, SO-IR can extract the background quickly and exactly. Benefits from this, our model can solve large-scale problems efficiently. This point is significant in the practical background modeling problems.

  • The tensor-wise model in the cyclic iteration is in fact a PCA model. Different from some other previous works, the simplified background matrix of a static background problem is explicitly rank-1, instead of just being low-rank. To constraint it, we propose a new space , in which the background actually lies. We solve the tensor model by modifying the traditional alternating direction multiplier technique.

  • We give a cyclic iteration that is composed of a tensor-wise model and a pixel-wise outlier removal strategy. In the general case, a tensor-wise process can always consider the overall information and is usually much faster, while a pixel-wise process pays more attention to the individual information and is always much more accurate. Our cyclic iteration makes full use of their particular advantages, thus being fast and accurate.

The remainder of this paper is organized as follows. Section 2 introduces some related works; Section 3 gives the formulation of SO-IR algorithm; Section 4 presents the policy of detecting the foreground region; Section 5 shows the experiments; Finally, Section 6 concludes our paper.

2 Related work

In our algorithm, the sparse representation model and the principal component analysis model play the key roles. The sparse representation process and the cyclic outlier removal strategy benefit a lot from the two mentioned models. We will review the related work of them here. Besides, we operate our work obeying the natural tensor structure of the video and the frames. So we also give some notations on tensors.

2.1 Sparse representation

In recent years, sparse representation(or sparse coding) has become a focus of researches JAASY2009 ; WDLXD2013 , and it is a powerful tool to clarify the structure of the data. With the items (or signal-atoms) from an over-complete dictionary, we can represent the dictionary and new inputs by a linear combination of the items ZMLYD2013 . Researchers follow this idea and explore the sparse structure of some practical problems, like abnormal event detection YJJ2011 and human action recognition TR2012 . They are in fact the problem of background extraction and foreground detection. In CXW2011 , authors assume that the backgrounds are the atoms of the dictionary, while consider the foreground and the noise as pollutions. In 2012, Ehsan et al. EGR2012 regarded each single frame as the atom and sought the ’special’ frames in a video with the help of sparse representation .

We follow the work of Ehsan et al., because the assumption is more reasonable in the complicated foreground objects problem. We set up sparse representation model for the video and explore the ’discriminative’ frames, from these frames we can extract the background exactly.

2.2 Principal component pursuit

Principal Component Analysis (PCA) is a most popular way to find the low-dimensional subspace. PCA solves the following optimization problem EXYJ2011 :


where denotes the given data matrix, and is the matrix spectral norm. A number of natural approaches to robustifying PCA have been explored and proposed in the literature over several decades. Unfortunately, no satisfying result is achieved.

Recently, Candès et al. EXYJ2011 proves that, one can exactly recover the low-rank matrix as well as the sparse error matrix under mild conditions. The model, which is known as Principal Component Pursuit (PCP), is as follows:


where is an appropriate weighting parameter, and denote the nuclear norm (sum of singular values) and the 1-norm (sum of the absolute values of matrix elements), respectively.

Model (1) and Model (2) are modified to solve the background modeling problem CTE2012 ; XCW2013 . They model the background by a low-rank matrix and model the foreground by a sparse matrix. Different from their ideas, we consider the background as an explicitly rank-1 matrix.

2.3 Tensors theory

A tensor is a multidimensional array. More formally, an -way or

-order tensor is an element of the tensor product of N vector spaces, each of which has its own coordinate system

BT2006 ; TB2009 . Intuitively, a vector is a 1-order tensor while a matrix is a 2-order tensor. In this paper, We denote vector by lowercase letter, e.g.,, and matrix by uppercase letter, e.g.,. What’s more, the higher order tensor is denoted by boldface letter, e.g.,. The space of all the tensors is denoted by swash letter, e.g.,. Denote the space of all the -order tensors by , .

There are several rank definitions of the N-order tensor . Here, we use the n-rank (or the multi-linear rank) which is based on the unfolding of a tensor TB2009 . The n-rank of is a set of ranks of different unfoldings:


Tensors can be multiplied together BT2006 . Let be of size and be of size . We can multiply the two tensors along the first modes, and the result is a tensor of size , given by


where (); () and ().

3 Sparse outliers iterative removal algorithm

In this section, we focus on modeling the background of a video, and will give the details of foreground detection in Section 4.

We use to denote a video. We know, a video contains a series of colorful frames, and assume that there are frames. Each frame is a 3-order tensor by nature, and the th frame of a video is denoted by . Then .

We analyse the components of a video first. In real frame series, the background is covered by the foreground objects. Denote the foreground region by , and the outside region by . Let be the orthogonal projector onto the span of tensors vanishing outside of . Then the -th component of is equal to if and zero otherwise. Thus the component of the video can be expressed as:


where and means the background and the foreground of the -th frame in the selected set , respectively. Actually, , because is just the foreground region. Besides the two mentioned parts, the noise is also an essential component in the video. i.e.


where is noise. Equation (6) gives the actual components of a video. The equation will be a strict constraint in our model.

3.1 ’Discriminative’ frames exploration by sparse representation

In most background modeling problems, the frame series are much too redundant for the task of background modeling. A few particular frames usually carry enough information of the background, while too many repeated frames in the video would just hinder the work. In a PCA model, the data is projected onto a lower dimensional space. The low dimensional space stands for the static information among the frames and other information is ignored. If the foreground objects are unchanged or changing slowly, they are more likely to be regarded as stable. In this section, we refine the frame sequences and get a new set , which is composed of the selected ’discriminative’ frames.

A video contains amounts of frames. Some of them can be represented by a linear combination of the rest ones and the others are approximately repeated. In other words, the set can be sparse represented by itself. We set as the original dictionary. Just for simplification, we transform each frame in the tensor into a gray one. The information maintained in the gray frames is enough for our work. Thus we denote the simplified set as . The sparse representation of the set is as follows:


where is the coefficient matrix. Model (7) is a standard tensor sparse representation problem, and is easy to solve.

In Model (7), multiplied by , the set can represent itself. And it is also true for the original set . Just by counting the number of the nonzero rows in , we can deduce the role of each frame in representing the whole set. All the useful frames are picked out from to form a low-level refined frame set .

The results of the above works are determined by the properties and the size of the frame series. However, as will be seen in our experiment, we select a few discriminative frames and extract the background successfully. This is because little change in the content of the video can break the linear relationship between different frames in most cases. To further reduce the size of the set , we can dynamically adjust the parameter in Model (7) for each specific problem. It is a complicated work. So we fix the parameter for convenience and get the selected set by the following operation.

We consider the space of all the frames, and each frame is an element in this space. We choose the Euclidean distance as the distance metric. Then it is a Euclidean space. We investigate whether a frame is ’discriminative’ enough or not by , i.e. its distance to all the other frames:


Thus we can select the top few ’discriminative’ frames with the help of (8) and form the selected set . We claim that the result of the background extraction does not depend on the whole frame series, it is the few ’discriminative’ frames that works.

Eventually, we get frames: . They are the elements selected from the low-level refined frame set , thus from the original set . The set is much smaller than the original one. In our experiments, we find that 20 to 30 frames are already enough to model the background in most videos, which could be composed of dozens of frames or even hundreds of frames.

Fig.2 illustrates the process of our algorithm. We are not the first who introduce sparse representation and dictionary learning into background modeling. However, different from the previous works CXW2011 ; YJJ2011 , we assume that the discriminative frames are the atoms of the dictionary. It’s a more reasonable assumption in practical problems. Besides, we use this process to refine the video, instead of modeling the background directly. The strategy of replacing the original data by the dictionary is effective, and it’s a reasonable dimension reduction method in solving large scale problems. Benefits from this process, the efficiency is improved largely and the needed memory is relaxed a lot. In next section, we continue our work on the selected ’discriminative’ set , instead of the original frame set .

Figure 2: The process of our SO-IR algorithm: First, we get the original frame set (50 frames). Second, we select 10 ’discriminative’ frames to form . At last, we solve the background .

3.2 Background extraction by cyclic iteration process

Our task in this section is to extract the background from the selected frame set. Benefits from the process in Section 3.1, the foreground objects in different frames are distinguishing. Thus the task of us can be concluded as Equation (9). The value of pixel () in the background is the linear weighted sum of all the corresponding values in the selected frames.


where is the weight series for calculating the value of pixel (). And here reflects the error, which is sufficiently close to zero. As to different pixels, we will have to seek different corresponding weight series.

Figure 3: The values of the pixel () in different frames.

Fig.3 gives an example, different values of pixel () in different frames are represented by some colorful solid points. In some frames, the backgrounds are polluted by noises or covered by foreground object, thus the values are far away from the ground truth. We call them the outliers. Obviously, their weights should be extremely small. On the other hand, the inliers are around the truth in most frames. And their corresponding weights should be updated to large ones.

Here we propose a cyclic iteration process, in which we combine the pixel-wise and the tensor-wise thoughts. We use a tensor model to calculate the purified-mean of all the frames. And based on the value of each pixel in the purified-mean frame, we update the values of all the frames by a pixel-wise outlier removal strategy.

3.2.1 Pixel-wise strategy

As is shown in Fig.3, the values of a pixel in different frames () are around the truth. We calculate their purified-mean, which is close to the mean of them. We use the purified-mean to approximate the true value in Equation (9), and its calculating process will be introduced in next section.

The purified-mean value may still be a little away from the ground truth. But it is much better than the worst outlier. So we replace the worst one with the purified-mean value.

The replacing process is shown in Fig.4. It shows the outlier removal process of the pixel () in the -th iteration. First, we get frames after the ()-th iteration, and we pick out the values of pixel () in these frames, i.e.the colorful solid points in Fig.4. Second, we calculate the purified-mean of the frames, then we can get the value of pixel () in the purified-mean frame, i.e. the black solid point. Third, we find the value that is the most far away from the purified-mean value, and replace it with the purified-mean value. At last, new values are got, in fact, only one of the values is different from the original values.

The above process is only for one pixel. In the -th iteration, we repeat this process for all the pixels. Thus, we get new frames after the -th iteration, and the purified-mean of them must be closer to the ground truth than last iteration. The iteration will continue until the value of each pixel in the purified-mean frame converges to a constant.

Figure 4: The outlier removal process of the pixel () in the -th iteration.

The outlier removal process is in fact the weight updating process of Formula (9). Once we use the purified-mean to replace the worst outlier, the corresponding weight is smaller than last iteration. But there’s no need for us to calculate the actual weights as what we care is the accumulation.

3.2.2 Tensor-wise PCA model

In this section, we will set up a tensor model to calculate the purified-mean of the selected frames. It is the mean of the frames at first, but moved a little away in the process of denoising. We solve the model by the modified Alternating Direction Multipliers (ADM) method, and the solution of it is limited to lie in a space.

The background is unchanged in our problem. In different frames, it should be all the same, i.e. , or the following:


where is a one-order tensor.

Constraint (10) in fact insists that background should be ’rank-1’. First, we simplify the tensor by combining the tensor’s mode-1 and mode-2 into one single mode, just like the vectorize process in the previous works CXW2011 ; CTE2012 ; EGR2012 , i.e. . And we define . Then, we give Lemma 1.

Lemma 1.

In a static background problem, the -rank of a simplified background tensor is: -, if and only if the slices (of the tensor) in different color channels are nonlinear correlated.

The ranks of and are all 3. This is caused by the nonlinear correlation of different color channels. In fact, is the most important conclusion. If the frames we deal with are gray, then it is a rank-1 constraint.

To solve Constraint (10), we consider a subspace of , which is denoted by . In this space there are all 4-order tensors, of which all the frontal slices are the same (i.e. ). Obviously, is convex, it is easy to find the solution to the problem in this space.

Lemma 2.

Given a tensor , the solution to the optimization problem


is . Actually, it is the average of the N frames.

All the former works in this section are for the whole frame series. They are also true for the frames in the selected set . As is illustrated before, what we want is the background part, or the static content. So we minimize the changing part to group more information into the background. Besides, we take strict Constraint (6) into account and give our model:


where is the norm, that equals the sum of all the nonzero absolute values in the set. In the foreground region, we minimize the number of differenct elements in . Outside the foreground region, a pixel is composed of the mixture of the background and the noise, so we just minimize the noise . Benefits from the property of the norm, we arrange the two regions into one single formula, and give the objective function in Model (12).

Then we solve Model (12) by a modified ADM method. We first arrange the model. The model is to extract the background, or the unchanged part among the frames. Thus the changing content, either the foreground object or the noise, is nothing but pollution on the background. From this point, we denote all the non-background parts by . Then Model (12) is transformed into the following form:


This model is much more simpler than Model (12). It is also a PCA model and the solution lies in the space.

In Model (13) the constraint can not be transformed into a single variable linear equation. What’s more, this constraint is strict, that cannot be relaxed. We will use this constraint as a correction term. Now, we consider the model:


Obviously, the variables here are tensors, instead of matrices. We can still follow the idea of ADM, which uses a multiplier to form the augmented Lagrangian, i.e.


where is the Lagrange multiplier.Then we get the iteration


where is in fact a step-length parameter. is a soft-threshold operator for tensor. And we get the iteration for by utilizing the following lemma:

Lemma 3.

Given a tensor , the solution to the optimization problem


is , where


Here the tensor plus (or minus) a single number means, all the elements of it plus (or minus) this number, respectively.

As we have illustrated before, the tensor must lie in the space. Once we get a new in one iteration, we project it onto the space. That means, we use its vertical projection on the space to replace it and continue the operation. In other words, we meet the constraint , and use to replace . Then the iteration is transformed into the following form:


The process above copies the idea of ADM. However, we use it for a tensor model, through designing the soft-threshold operator for tensor and utilizing some theories of tensor.

3.3 Algorithm formulation and convergence analysis

Here, we give our Sparse Outliers Iterative Removal Algorithm (SO-IR). And its convergence condition is -.

3: :
        , where
4: :
5:end while.

As to Algorithm 1, the sparse representation of the frames is the key part for guaranteeing that enough information is carried by the selected frames. The cyclic iteration process is the main part. In this process, we calculate the purified-mean of the selected frames, it is used to update the frames by replacing the outliers pixel-wise.

Now we discuss the convergence of SO-IR. Here we prove that, for an arbitrary pixel (), the cyclic iteration process will return a solution. Then the conclusion also works for the tensor.

We know that there are frames in the selected set . Then for the pixel (), there are values as is shown in Fig.3. We assume that, the minimum is and the maximum is , thus all the values belong to the interval []. We record the minimum and the maximum of the values in the -th iteration as and . The value of purified-mean must be between and . Thus if the interval converges to one point, the purified-mean series must converge to the same point, and it is the solution of out algorithm. In other words, we will have to prove that: .

First, for the , which is close to 0, we have: . This can be inferred from the process of the PCA model. The purified-mean value is just around the mean of all the values, and it must be inside the interval if is small. Otherwise, the values in the -th iteration must be close to each other, or to say, they converge to a point. Second, we know that . This is because that, in each iteration, the worst outlier is replaced by the purified-mean value. It must belong to . Thus after iterations, the maximum and the minimum of all the values must be closer. Finally, ,,. When the number of iterations increases from 1 to infinity, the interval [] gets smaller and smaller. Then there must be a constant , we have .

We have to insist that, the solution may not be the ground truth. As mentioned above, the process is influenced by the property of the video. The performance in Section 5 will show that, the solution is pretty close to the ground truth if only the video is not so bad.

4 Foreground region detection

As illustrated before, a frame in the video is composed of three parts: the background, the foreground objects and all kinds of noise. In section 3, we efficiently compute the background tensor. Now, our task is to detect the foreground region.

4.1 Background subtraction

Background subtraction is a common method to detect the foreground region. And it is the first step in foreground detection.

Constraint (6) gives the formulation of the frame series in our model. It also works for each single frame , i.e.


The result got in Section 3 is for the whole video. We denote the background of this frame by . Then we can get the result of subtraction, which is denoted by :


We find that the residual background only exists in the foreground region, and outside this region there are nothing but the noises, i.e.


From Expression (22), we can see the essence of background subtraction. The result depends on the properties of and , as well as the relationship between them. Thus it is easy to understand some so-called impossible works, for example, a white coat is hard to be detected when hanging on a white wall. If the distribution of is the same with that of , it’s almost impossible to detect the foreground region in addition to some video semantic analysis methods.

4.2 Foreground detection

In this section, we explore the foreground region for an arbitrarily given image from the original frame set . To simplify the problem, we transform the colorful frame into a gray one . In most cases, the foreground objects are contiguous pieces. We can model the region by a Markov Random Field SD1984 , just follow the idea of some previous works XCW2013 ; S2009 .

Figure 5: The background extracting results with the size of the selected set varying from 1 to 30; The standard background is extracted when the selected set is composed of 40 frames.

First, we set up a matrix to represent the foreground region :


It’s easy to find that, a pixel is inside the foreground region if it is labeled with 1 in the matrix . Otherwise, the pixel must be lying outside the foreground region. Then the energy of can be got by the Ising modelS2009 :


where and are two positive parameters, that penalize and , respectively.

Obviously, if we just minimize the energy of foreground region , it will converge to an empty set, i.e. . In the foreground detecting process, we also tend to allocate the major information of the background subtraction into the foreground part. So an important component of the objective function is . Besides, the non-zero elements outside the foreground region should also be minimized. Then we get the following model:


The foreground detecting problem can be rearranged as follows:


where is a projection matrix. Meanwhile, the constant part is omitted, as it is insignificance in a optimization problem. Thus Problem (26) is the standard form of the first-order MRFs. It can be solved exactly using graph cuts YOR2001 .

5 Experimental analysis

In this section, we evaluate the performance of our Sparse Outliers Iterative Removal algorithm (SO-IR). We explore the appropriate number of ’discriminative’ frames, test the performance of our algorithm and check its ability of solving the large-scale problems. The experiments are operated on some real sequences from public datasets, like the I2R dataset LWIQ2004 , the flowerwall dataset KJBB1999 , etc. Besides, other sequences of the real video from the public resource are also included in our experiments. All the experiments are conducted and timed in Matlab R2010a on a PC with an Intel(R) Core(TM) 3.20GHz CPU and 4GB of RAM.

5.1 Number of the discriminative frames

A major work of this paper is that, while purifying the frame series of a video, we utilize the work of sparse representation to select the most discriminative frames. In this section, we will explore the appropriate number of the discriminative frames. We operate our measurement on the I2R dataset. We provide our details while experimenting on the ”Bootstrap” sequences of the I2R dataset at first. It is a video whose scene is in front of the buffet in a restaurant. We also use some other video sequences in the dataset and the corresponding results are also given.

We use the first 300 frames in the video sequence as our original frame set , and measure the performance of SO-IR algorithm when the frame number of the selected frame set varies from one to thirty. Besides, we need a standard background for comparison. We use the first forty frames as the selected set to extract the standard background, because it is already pretty close to the globally optimal solution. The result is shown in Fig.5. We can see that, most of the extracted backgrounds are quite similar to the standard one, even if the number of the selected frames is small. However, it’s a little depress that the counter is not recovered exactly even in our standard background. It is limited by the properties of the video sequence. The two little fuzzy areas are in fact the spaces just in front of the buffet, and there are always people standing there and taking the meal in almost all the 300 frames. As a result, we have no evidence to prove that the two areas do not belong to the background.

Figure 6: The relationship between the number and the performance. The distance ratio is that, we divide the real distance between the result and the standard one by the distance between the standard one and the original of coordinate.

Now, we measure the relationship between the rate of convergence and the number of discriminative frames. We compare each result with the standard one, and calculate the distance between them. Meanwhile, we use the distance between the standard background and the origin of the coordinate system as the standard distance. Then we use it to normalize all the distances. The result is shown in Fig.6.

We can see from Fig.6 that, the performance is not well enough when the number of frames is quite small. And it gets better and better as the number increases. When the number of the discriminative frames increases to about twenty, the distance ratio starts to fluctuate. The fluctuation is natural. In our process, it’s easy to see that all weights in the equation are non-zeros. When a new frame is added into the selected set as discriminative frame, and if the corresponding pixel is not in the foreground region, it will bring a negative influence on our performance. Although its weight will be smaller and smaller in the later iterations, the weight is always a positive one. In contrast, if the pixel reflects the information of the background, our performance will be better enough. Obviously, we can get rid of the fluctuation by improving the requirement on the convergence accuracy, bringing on an enormous growth of computation complexity. Next, we repeat these operations on some other sequences, and show the results in Fig.7.

Figure 7: The results on other 4 sequences: The ’Fountain’ in the I2R dataset; The ’ShoppingMall’ in the I2R dataset; The ’WavingTrees’ in the flowerwall dataset; The Sequence 2 of the sequences introduced in the paper .
Figure 8: The experiments on different video sequences. The results are shown together with the time consumption(The unit of time is second). The sequences are selected from the I2R dataset and the flowerwall dataset. The experiment is divided into four periods: the experiment on the first 300 frames, on the first 600 frames, on the first 900 frames and on the first 1200 frames.

From Fig.7 we can see that, when the number of the discriminative frames is small, the performance is not so good. As the number of the frames increases, the distance ratio decreases. We can also find that, when the frame number increases to more than twenty, the ratio tends to be stable. The solution of our model is fluctuating around the optimal solution under the allowed error. What’s more, the distance ratio of the ’Fountain’ sequence is almost close to be stable when the frame number is pretty small, this benefits from the property of the frame series, most of which are already the backgrounds although polluted by some noises. The result of the ’WavingTrees’ sequence is not so fascinating. In this sequence, the background keeps changing its shape all the time. Once a new frame is added into the selected set, we will have to adjust our result based on the new shape of the tree.

In this section, we study the relationship between the performance and the size of the selected set (equals the number of the discriminative frames). we can exactly recover the background from about twenty frames. Even if in some bad conditions, twenty-five frames are enough. This agrees with our former claim that, the result does not depend on the whole video, it’s the few ’discriminative’ frames that works. Most frames in the video series bring us nothing but some repeat information. Besides, we can also find that, the content of the video also affects the performance. In some special cases, a frame in the video may be just the exact background.

5.2 Experiments on large scale sequences

In practical applications, the resolution of video becomes higher and higher, and high definition cameras are widely applied now. The problem we are faced with is getting more and more difficult. Most traditional methods extract the background from the sequences whose resolution is around 150150, and the number of the frames is usually around 50. When the scale of the data increases, these methods tend to be slow. Although some pixel-wise methods can still be quick enough by reducing the size of training set, their precision is influenced a lot.

In this section, we will check the ability of our model in solving the large scale problems. First, we pay attention to the number of the sequence. We operate our model on the data that consists of more than 1200 frames. We extract the background in four periods, i.e.,from the first 300 frames, the first 600 frames, the first 900 frames and the first 1200 frames, respectively. We show the time and the result in Fig.8.

Figure 9: The experiments on the videos whose resolutions are much higher(The unit of time is second). The resolutions are given out at the top of each column. As for each video sequence, we use the first 50 frames and first 150 frames to extract the background, respectively.

When dealing with the sequences that the number is large, our model solves the background efficiently. Hundreds of frames only cost us dozens of seconds. As the number of frames increases, the precision of the extracted background is improved. Temporary stay is a problem that exists in most traditional background modeling methods. Once a person stays in a place for a while, he may be considered as a part of background in a short video. We can also see from the experiment that, as the number of the frames increases, the temporary stay is solved perfectly. The experiment on the ’Hall’ sequence just illustrates this. The person is regarded as background in the first period (300 frames), and is removed from the background after he goes away. We can also see from the result that, the time consumption is not linear with the number of the frames. In fact, the property of the video affects the speed. On one hand, the uncertainty of background caused by the temporary stay may cost some time. In the experiment on ’ForegroundAperture’, the person who lies on the table goes away after around the 500th frame. The process of the first 300 frames is pretty quick because all these frames are almost the same. The other three periods spend so much time to decide whether this person belongs to the background or not. This is also true for the last period (1200 frames) of the experiment on the ’Lobby’ sequence. On the other hand, the situation of the foreground and the noise also influences the time consumption. The waving trees in the ’Campus’ sequence and the person in the ’Moved Object’ sequence really take us some time to get rid of them in the process of extracting the background.

Second, we will try our model on the video sequences whose resolution is much higher than the usual dataset. Here we use four video sequences for our experiments, one of which is the ’ShoppingMall’ sequence of the I2R dataset. And the other three are the intersection monitoring videos from the public resource. We test our model in the first 50 frames and the first 150 frames of each sequence. The result is shown in Fig.9.

If the resolution increases from 320256 to 720576, the time consumption also increases sharply. In fact the pixels in each frame of the last video are more than 5 times of the first one. Our model spends about 100 seconds to solve the high resolution video that consists of 150 frames. When solving some medium resolution video, our model only demands dozens of seconds. Besides, we can also conclude from Fig.9 that, the content of the video will influence our performance. In the first video, the temporary stay is identified as the background at first, but is weeded out after the frames’ number increases. In the third video, the distant cars move slowly in the fixed lens due to the perspective. It’s actually the approximately temporary stay. We can find that 150 frames are still not enough to get rid of this phenomenon, and more frames are needed to solve these practical problems.

resolution: number PCP DECOLOR SO-IR
109.72 217.37 9.08
231.47 569.89 9.98
370.57 1173.13 11.55
465.70 1389.76 17.63
Table 1: The time consumption of PCP, DECOLOR and SO-IR.

For comparison, we also examine the time consumption of those PCA models, i.e. the Principal Component Pursuit (PCP) EXYJ2011 and the Detecting Contiguous Outliers in the Low-Rank Representation (DECOLOR) XCW2013 , which perform well in small-scale problems. The results in different scales (different resolutions and different frame numbers) of data are shown in Table 1.

Figure 10: From left to right: original image, exacted background by SO-IR, ground truth, SO-IR, PCP, MOG, DECOLOR. From top to bottom: Camouflage (b00251, flowerwall), Curtain (Curtain22772, I2R), hall (airport2180, I2R), ShoppingMall (ShoppingMall1980, I2R), WavingTrees (b00247, flowerwall), Escalator (airport4595, I2R), ForegroundAperture (b00489, flowerwall).

As is shown in Table 1, PCP is faster than DECOLOR, because DECOLOR pays more attention to the precision. But they are all much slower than SO-IR. The advantage of SO-IR gets absolute when the size of the data gets larger. When the frame number is 150, SO-IR is about 10 times faster than PCP. And it is 30 times faster than PCP when the frame number increases to 450. Besides, when the resolution increases from 160 120 to 320 256, SO-IR is also 30 times faster than PCP. If the scale of the data is large, the major time consumption of SO-IR is to explore the ’discriminative’ frames. Once we get these frames, we can model the background from them fastly and exactly.

From the experiments operated in this section, we can see our model’s ability of solving the large-scale problems. The problems are solved quickly. This is because we solve the problem from the explored ’discriminative’ frames. Obviously, it takes us some time to finish the exploring process, but the process helps us get rid of the repeated and useless frames, thus saves us much more time in modeling the background.

5.3 Detecting the foreground

In this section, we detect the foreground region based on the calculated background. We also compare our performance with some other researches, i.e.the MOG CW1999 , the PCP EXYJ2011 , and the DECOLOR XCW2013 . The DECOLOR models the background by a low-rank matrix and models the foreground by MRF.

For quantitatively evaluate the performance of different algorithms, we compute the precision and the recall as follows:


where , , , are the true positive, false positive, true negative, false negative detections, respectively. When the recall is high enough without altering the precision, the performance is good. We use the measurement called F-measure to check this:


In the experiment, we use the video sequences from the I2R dataset and flowerwall dataset. In these sequences, the hand-segmented foreground region of some frames are already given out. Thus, once we detect the foreground based on the result of the background extraction, we compare the detected foreground region with the given ground truth and calculate the corresponding F-measure. The test frame is chosen randomly from all the hand-segmented frames. To avoid the influence of the temporary stay, we need enough frames to extract a reliable background. Here we use 250 frames, of which the last frame is the test frame, to form the data set in the background extracting process. Deservedly, the background is for all the 250 frames. We can detect the foreground regions for each of them with the background.

The sequences and the results are shown in Fig.10. In the experiments, SO-IR can exactly extract the background for almost all the sequences, while the last video (g) is an exception. Because the person stays there all the time, the set of 250 frames is not enough for the temporary stay in this video, while it is enough for the other six videos. We can see that, SO-IR also performs well in the task of foreground detection. This benefits from the accurate result of the background as well as the model of MRF. DECOLOR performs well as it also models the foreground by MRF. In most sequences, the results of SO-IR are better than those of DECOLOR, because the extracted backgrounds of us are more exact. The other two non-MRF methods detect the foreground region roughly. In fact, they can find the border of the foreground region exactly, while being bothered by the complex noises. Then we present the corresponding F-measures of Fig.10.

(a) 0.9737 0.6110 0.2047 0.5669
(b) 0.9020 0.7129 0.3841 0.8244
(c) 0.8452 0.6986 0.5406 0.7225
(d) 0.8314 0.5248 0.2498 0.6439
(e) 0.8170 0.6046 0.4014 0.8966
(f) 0.7972 0.5902 0.2455 0.6487
(g) 0.6382 0.5104 0.1962 0.3941
Table 2: Quantitative Evaluation (F-Measures) on the Sequences Shown in Fig.10 .

Table 2 gives out the F-measures of all the detected foreground region in Fig.10. We can see that the results of SO-IR are better than the other three methods in six sequences, i.e.(a),(b),(c),(d),(f),(g). And it is a little worse than DECOLOR in the sequence (e). We can also find that, the performance of SO-IR varies among different videos. On one hand, it’s due to the result of background extraction. It is the case for the sequence (g). On the other hand, the instability of the video’s background also affects the performance of SO-IR. The waving trees and the moving escalator in sequence (e) and sequence (f) respectively are all blocks in the foreground detection process. In summary, the performance of SO-IR is competitive, the advantage is obvious even when the property of the video is bad.

6 Conclusion and future work

In this paper, we propose a Sparse Outliers Iterative Removal (SO-IR) algorithm to model the background of a video. We find that, a few ’discriminative’ frames are already enough to model the background. We propose the sparse representation process to refine the original data set. Although exploring the ’discriminative’ frames wastes us some time, it saves us much more time in modeling the background. Besides, we propose a cyclic iteration process to extract the background, which combines a tensor-wise PCA model and a pixel-wise outlier removal strategy. SO-IR achieves high accuracy as well as high speed simultaneously in dealing with the real video sequences. Especially, it shows its great advantage in large-scale problems.

In the future work, we will deal with some more complex problems, in which the background is no longer stable among different frames. The work of sparse representation will be further combined with the background modeling problem.


This work was partially supported by the National Natural Science Foundation of China (No. 51275348, No. 61379014, No. 6122210), and New Century Excellent Talents in University (Grant No. NCET-12-0399).



  • (1)

    C.Stauffer, W.Grimson, Adaptive background mixture models for real-time tracking, in:IEEE Conference on Computer Vision and Pattern Recognition(CVPR),1999.

  • (2) H.Li, C.Shen, Q.Shi, Real-time visual tracking using compressive sensing, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2011, 1305-1312.
  • (3) K.Zhang, L.Zhang, M.Yang, Real-time compressive tracking, in:European Conference on Computer Vision(ECCV), 2012, 864-877.
  • (4) T.Bouwmans, Recent advanced statical background modeling for foreground detection-A systematic survey, Recent Patents on Computer Science 6(3)(2011) 147-176.
  • (5) A.Yilmaz, O.Javed, M.Shah, Object tracking: A survey, ACM Computering Survey(CSUR) 38(4)(2006) 1-45.
  • (6) S.Li, H.Lu, L.Zhang, Arbitrary body segmentation in static images, Pattern Recognition 45(9)(2012) 3402-3413.
  • (7) D.Park, H.Byun, A unified approach to background adaptation and initialization in public scenes, Pattern Recognition 46(7)(2013) 1985-1997.
  • (8) M.Droogenbroeck, O.Paquot, Background subtraction: Experiments and improvements for vibe, in:IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshop(CVPRW), 2012, 32-37.
  • (9) V.Mahadevan, N.Vasconcelos, Background subtraction in highly dynamic scenes, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2008, 1-6.
  • (10) Y.Chen, C.Chen, C.Huang, Y.Hung, Efficient hierarchical method for background subtraction, Pattern Recognition 40 (10)(2007) 2706-2715.
  • (11) S.Cheung, C.Kamath, Robust techniquences for background subtraction in urban traffic video, Proceedings of Video Communications and Image Processing(VCIP), 2004, 881-892.
  • (12) S.Wang, T.Su, S.Lai, Detecting moving objects from dynamic background with shadow removal, in:IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP), 2011, 925-928.
  • (13) A.Ulges, T.Breuel, A local discriminative model for background subtraction, Pattern Recognition. Springer Berlin Heidelberg, 2008, 507-516.
  • (14) C.Wren, A.Azarbayejani, T.Darrell, A.Pentland, Pfinder:Real-time tracking of human body, IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7)(1997) 780-785.
  • (15) P.KaewTraKuiPong, R.Bowden, An improved adaptive background mixture model for real-time tracking with shadow detection, Video-Based Surveillance Systems, 2002, 135-144.
  • (16)

    Z.Zivkovic, Improved adaptive Gaussian mixture model for background subtraction, in:Proceedings of the 17th International Conference on Pattern Recognition, 2004, 28-31.

  • (17) D.Lee, Effective Gaussian mixture learning for video background subtraction, IEEE Transactions on Pattern Analysis and Machine Intelligence 27(5)(2005) 827-832.
  • (18) K.Kim, T.Chalidabhongse, D.Harwood, L.Davis, Background modeling and subtraction by codebook construction, in:IEEE International Conference on Image Processing(ICIP), 2004, 3061-3064.
  • (19) K.Kim, T.Chalidabhongse, D.Harwood, L.Davis, Real-time foreground Cbackground segmentation using codebook model, Real-time Imaging 11(3)(2005) 172-185.
  • (20) J.Guo, Y.Liu, C.Hsia, C.Hsu, Hierarchical method for foreground detection using codebook model, IEEE Transactions on Circuits and Systems for Video Technology 21(6)(2011) 804-815.
  • (21) A.Zaharescu, M.Jamieson, Multi-scale multi-feature codebook-based background subtraction, in:IEEE International Conference on Computer Vision workshops, 2011, 1753-1760.
  • (22) A.Hamad, N.Tsumura, Background subtraction based on time-series clustering and statistical modeling, Optical Review 19(2)(2012) 110-120.
  • (23)

    A.Elgammal, D.Harwood, L.Davis. Non-parametric model for background subtraction, in:Computer Vision ECCV 2000 Springer Berlin Heidelberg, 2000, 751-767.

  • (24)

    E.Learned-Miller, M.Narayana, A.Hanson, Background modeling using adaptive pixelwise kernel variances in a hybrid feature space, in:IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2012, 2104-2111.

  • (25) O.Barnich, M.Van Droogenbroeck, ViBe: A universal background subtraction algorithm for video sequences, IEEE Transactions on Image Processing 20(6)(2011) 1709-1724.
  • (26) L.Li, W.Huang, I.Gu, Q.Tian, Statistical modeling of complex backgrounds for foreground object detection, IEEE Transactions on Image Processing 13(11)(2004) 1459-1472.
  • (27) M.Heikkila, M.Pietikainen, A texture-based method for modeling the background and detecting moving objects, IEEE Transactions on Pattern Analysis and Machine Intelligence 28(4)(2006) 657-662.
  • (28) L.Zhang, W.Dong, X.Wu, G.Shi, Spatial-Temporal color video reconstruction from noisy CFA sequence, IEEE Transactions on Circuits and Systems for Video Technology 20(6)(2010) 838-847.
  • (29) S.Liao, G.Zhao, V.Kellokumpu, M.Pietikainen, S.Li, Modeling pixel process with scale invariant local patterns for background subtraction in complex scenes, in:IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2010, 1301-1306.
  • (30) N.Oliver, B.Rosario, A.Pentland, A bayesian computer vision system for modeling human interactions, IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8)(2000) 831-843.
  • (31) Z.Zhou, X.Li, J.Wright, E.Candes, Y.Ma, Stable principal component pursuit, in: IEEE International Symposium on Information Theory Proceedings(ISIT), 2010, 1518-1522.
  • (32) E.Candes, X.Li, Y.Ma, and J.Wright, Robust principal component analysis?, Journal of the ACM(JACM) 58(3)(2011) No.11.
  • (33) G.Tang, A.Nehorai, Robust principal component analysis based on low-rank and block-sparse matrix decomposition, in:IEEE Annual Conference on Information Sciences and Systems(CISS), 2011, 1-5.
  • (34) C.Guyon, T.Bouwmans, E.Zahzah, Foreground detection based on low-rank and block-sparse matrix decomposition, in:IEEE International Conference on Image Processing(ICIP), 2012, 1225-1228.
  • (35) X.Zhou, C.Yang, W.Yu, Moving object detection by detecting contiguous outliers in the low-Rank representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 35(3)(2013) 597-610.
  • (36)

    J.Wright, A.Yang, A.Ganesh, S.Sastry, Y.Ma, Robust face recognition via sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2)(2009) 210-227.

  • (37) W.Zuo, D.Meng, L.Zhang, X.Feng, D.Zhang, A generalized iterated shrinkage algorithm for non-convex sparse coding, in:IEEE International Conference on Computer Vision(ICCV), 2013.
  • (38) Z.Feng, M.Yang, L.Zhang, Y.Liu, D.Zhang, Joint discriminative dimensionality reduction and dictionary learning for face recognition, Pattern Recognition 46(8)(2013) 2134-2143
  • (39) Y.Cong, J.Yuan , J.Liu, Sparse reconstruction cost for abnormal event detection, in:IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2011, 3449-3456.
  • (40) T.Guha, R.Ward, Learning sparse representations for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(8)(2012) 1576-1588.
  • (41) C.Zhao, X.Wang , W.Cham, Background subtraction via robust dictionary learning, EURASIP Journal on Image and Video Processing, 2011.
  • (42) E.Elhamifar, G.Sapiro, R.Vidal, See all by looking at a few: Sparse modeling for finding representative objects, in:IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2012, 1600-1607.
  • (43) B.Bader, T.Kolda, Algorithm 862: MATLAB tensor classes for fast algorithm prototyping, ACM Transaction on Mathmatical Software(TOMS) 32(4)(2006) 635-653.
  • (44) T.Kolda, B.Bader, Tensor decompositions and applications, Society for Industrial and Applied Mathematics(SIAM) 51(3)(2009) 455-500.
  • (45) S.Geman, D.Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6)(1984) 721-741.
  • (46) S.Li, Markov random field modeling in image analysis, Springer Publishing Company, 2009.
  • (47) Y.Boykov, O.Veksler , R.Zabih, Fast approximate energy minimization via graph cuts, IEEE Transactions on Pattern Analysis and Machine Intelligence 23(11)(2001) 1222-1239.
  • (48) K.Toyama, J.Krumm, B.Brumitt, B.Meyers, Wallflower: Principles and practice of background maintenance, in:IEEE International Conference on Computer Vision(ICCV), 1999, 255-261.