1 Introduction
Background modeling of a video is a key part in many visionbased applications, such as realtime tracking CW1999 ; HCQ2011 ; KLM2012 , information retrieval and video surveillance T2011 ; AOM2006 . In a video, which consists of a series of frames, there are some sceneries staying almost constant,although being polluted by noise SHL2012 . A model for extracting the invariable part is important. It can help us to handle the video, especially in the public scenes DH2013 . In some cases, background modeling is an essential step in the task of foreground detection MO2012 ; VN2008 . Once we extract the background, we can detect or even track the foreground information, just by comparing the incoming frame with the learned background model AOM2006 .
In the traditional background modeling problem, something make the background nonstationary, such as the fluttering flags, the waving leaves and the ripple water, etc. YCCY2007 . Besides, there are still some other issues SC2004 , like signal noise, sudden lighting changes and shadows STS2011 ; AT2008 , prevent us from distinguishing the background and the foreground. In addition, with the development of technologies and the improvement of equipments, a new problem appears in many practical applications. The data of a practical background modeling problem becomes larger and larger. Thus the time consumption and the required memory become key issues of an effective algorithm.
A large number of background modeling methods have been reported in the literature over the past few decades. Most researchers regard the pixel series as the features, and set up pixelwise models. One idea is to model each pixel series by the Gaussian distribution. Two pioneering works are the single Gaussian model in 1997
CATA1997 and the Multiple of Gaussian (MOG) in 1999 CW1999 . Based on the two works, some improved algorithms also achieved good performance PR2002 ; Z2004 ; D2005 in the following few years. Besides the idea of Gaussian distribution, the clustering methods are also used to model the background, i.e.the codebook KTDL2004 ; KTDL2005 ; JYCC2011 ; AM2011 and timeseries clusteringAN2012 . Furthermore, the nonparameter method was proposed in 2000 ADL2000 and was improved in 2012 EMA2012 , which had shown competitive performance. Recently, a new method (ViBe) was proposed in 2012 and was improved afterwards, which performs better than mainstream pixelwise techniques MO2012 ; OM2011 . The mentioned methods solve the problem by setting model for each pixel and initialize all the models in the training process. Although higher accuracy can be obtained based on sufficient training data, the speed will be restricted by the size of the data. Thus there are constraints between the precision and the speed. Besides, it is also a challenging task to obtain ’good enough’ training data.Another category deals with the background modeling problem at the region level. Some works pay attentions to the local region, and different corresponding features are proposedSTS2011 ; LWIQ2004 ; MM2006 ; LWXG2010 ; SGV2010 . Furthermore, the global region based algorithms have achieved better performance than the others. In 2000, Oliver et al. NBA2000
first modeled the background by Principal Component Analysis (PCA). This method models the background by projecting the highdimensional data into a lower dimensional subspace. Recently, robust PCA (RPCA) in 2010
ZXJEY2010 and Principal Component Pursuit(PCP) in 2011 EXYJ2011have shown their superiority than the original PCA. Hence some heuristic methods are raised
GA2011 ; CTE2012 ; XCW2013in the following few years. These models get rid of the training process, and can gain all the information contained in an arbitrarily given data. However, to solve these models, the Singular Value Decomposition(SVD) is an essential step, and it is well known that SVD is very time consuming. Then the speed and the required memory is sensitive to the scale of the data. Thus, these models are limited in largescale problems.
In this paper, we propose a Sparse Outliers Iterative Removal (SOIR) algorithm. SOIR meets the demand of solving largescale problems, and achieves high accuracy as well as high speed simultaneously. In our algorithm, to solve the largescale problems, we introduce sparse representation(or sparse coding) into the work of modeling the background. We explore the ’discriminative’ frames that are much less and are ’discriminative’ enough to model the background. Besides, we propose a cyclic iteration, which is composed of a tensorwise PCA model and a pixelwise outlier removal strategy, to extract the background from the ’discriminative’ frames. The mentioned two parts are a whole, and we call it Sparse Outliers Iterative Removal (SOIR) algorithm. The framework of it is shown in Fig.1. In addition, we also detect the foreground object by the Markov Random Field(MRF). Experiments show that SOIR outperforms most mainstream algorithms both in the speed and in the precision, especially in large scale problems. The main contributions can be summarized as follows:

We utilize sparse representation to refine the frames of the video. In the work of background modeling, too many redundant frames usually deteriorate the performance. By working on the selected frame set, which is composed of the discriminative frames, SOIR can extract the background quickly and exactly. Benefits from this, our model can solve largescale problems efficiently. This point is significant in the practical background modeling problems.

The tensorwise model in the cyclic iteration is in fact a PCA model. Different from some other previous works, the simplified background matrix of a static background problem is explicitly rank1, instead of just being lowrank. To constraint it, we propose a new space , in which the background actually lies. We solve the tensor model by modifying the traditional alternating direction multiplier technique.

We give a cyclic iteration that is composed of a tensorwise model and a pixelwise outlier removal strategy. In the general case, a tensorwise process can always consider the overall information and is usually much faster, while a pixelwise process pays more attention to the individual information and is always much more accurate. Our cyclic iteration makes full use of their particular advantages, thus being fast and accurate.
2 Related work
In our algorithm, the sparse representation model and the principal component analysis model play the key roles. The sparse representation process and the cyclic outlier removal strategy benefit a lot from the two mentioned models. We will review the related work of them here. Besides, we operate our work obeying the natural tensor structure of the video and the frames. So we also give some notations on tensors.
2.1 Sparse representation
In recent years, sparse representation(or sparse coding) has become a focus of researches JAASY2009 ; WDLXD2013 , and it is a powerful tool to clarify the structure of the data. With the items (or signalatoms) from an overcomplete dictionary, we can represent the dictionary and new inputs by a linear combination of the items ZMLYD2013 . Researchers follow this idea and explore the sparse structure of some practical problems, like abnormal event detection YJJ2011 and human action recognition TR2012 . They are in fact the problem of background extraction and foreground detection. In CXW2011 , authors assume that the backgrounds are the atoms of the dictionary, while consider the foreground and the noise as pollutions. In 2012, Ehsan et al. EGR2012 regarded each single frame as the atom and sought the ’special’ frames in a video with the help of sparse representation .
We follow the work of Ehsan et al., because the assumption is more reasonable in the complicated foreground objects problem. We set up sparse representation model for the video and explore the ’discriminative’ frames, from these frames we can extract the background exactly.
2.2 Principal component pursuit
Principal Component Analysis (PCA) is a most popular way to find the lowdimensional subspace. PCA solves the following optimization problem EXYJ2011 :
(1) 
where denotes the given data matrix, and is the matrix spectral norm. A number of natural approaches to robustifying PCA have been explored and proposed in the literature over several decades. Unfortunately, no satisfying result is achieved.
Recently, Candès et al. EXYJ2011 proves that, one can exactly recover the lowrank matrix as well as the sparse error matrix under mild conditions. The model, which is known as Principal Component Pursuit (PCP), is as follows:
(2) 
where is an appropriate weighting parameter, and denote the nuclear norm (sum of singular values) and the 1norm (sum of the absolute values of matrix elements), respectively.
2.3 Tensors theory
A tensor is a multidimensional array. More formally, an way or
order tensor is an element of the tensor product of N vector spaces, each of which has its own coordinate system
BT2006 ; TB2009 . Intuitively, a vector is a 1order tensor while a matrix is a 2order tensor. In this paper, We denote vector by lowercase letter, e.g.,, and matrix by uppercase letter, e.g.,. What’s more, the higher order tensor is denoted by boldface letter, e.g.,. The space of all the tensors is denoted by swash letter, e.g.,. Denote the space of all the order tensors by , .There are several rank definitions of the Norder tensor . Here, we use the nrank (or the multilinear rank) which is based on the unfolding of a tensor TB2009 . The nrank of is a set of ranks of different unfoldings:
(3) 
Tensors can be multiplied together BT2006 . Let be of size and be of size . We can multiply the two tensors along the first modes, and the result is a tensor of size , given by
(4) 
where (); () and ().
3 Sparse outliers iterative removal algorithm
In this section, we focus on modeling the background of a video, and will give the details of foreground detection in Section 4.
We use to denote a video. We know, a video contains a series of colorful frames, and assume that there are frames. Each frame is a 3order tensor by nature, and the th frame of a video is denoted by . Then .
We analyse the components of a video first. In real frame series, the background is covered by the foreground objects. Denote the foreground region by , and the outside region by . Let be the orthogonal projector onto the span of tensors vanishing outside of . Then the th component of is equal to if and zero otherwise. Thus the component of the video can be expressed as:
(5) 
where and means the background and the foreground of the th frame in the selected set , respectively. Actually, , because is just the foreground region. Besides the two mentioned parts, the noise is also an essential component in the video. i.e.
(6) 
where is noise. Equation (6) gives the actual components of a video. The equation will be a strict constraint in our model.
3.1 ’Discriminative’ frames exploration by sparse representation
In most background modeling problems, the frame series are much too redundant for the task of background modeling. A few particular frames usually carry enough information of the background, while too many repeated frames in the video would just hinder the work. In a PCA model, the data is projected onto a lower dimensional space. The low dimensional space stands for the static information among the frames and other information is ignored. If the foreground objects are unchanged or changing slowly, they are more likely to be regarded as stable. In this section, we refine the frame sequences and get a new set , which is composed of the selected ’discriminative’ frames.
A video contains amounts of frames. Some of them can be represented by a linear combination of the rest ones and the others are approximately repeated. In other words, the set can be sparse represented by itself. We set as the original dictionary. Just for simplification, we transform each frame in the tensor into a gray one. The information maintained in the gray frames is enough for our work. Thus we denote the simplified set as . The sparse representation of the set is as follows:
(7) 
where is the coefficient matrix. Model (7) is a standard tensor sparse representation problem, and is easy to solve.
In Model (7), multiplied by , the set can represent itself. And it is also true for the original set . Just by counting the number of the nonzero rows in , we can deduce the role of each frame in representing the whole set. All the useful frames are picked out from to form a lowlevel refined frame set .
The results of the above works are determined by the properties and the size of the frame series. However, as will be seen in our experiment, we select a few discriminative frames and extract the background successfully. This is because little change in the content of the video can break the linear relationship between different frames in most cases. To further reduce the size of the set , we can dynamically adjust the parameter in Model (7) for each specific problem. It is a complicated work. So we fix the parameter for convenience and get the selected set by the following operation.
We consider the space of all the frames, and each frame is an element in this space. We choose the Euclidean distance as the distance metric. Then it is a Euclidean space. We investigate whether a frame is ’discriminative’ enough or not by , i.e. its distance to all the other frames:
(8) 
Thus we can select the top few ’discriminative’ frames with the help of (8) and form the selected set . We claim that the result of the background extraction does not depend on the whole frame series, it is the few ’discriminative’ frames that works.
Eventually, we get frames: . They are the elements selected from the lowlevel refined frame set , thus from the original set . The set is much smaller than the original one. In our experiments, we find that 20 to 30 frames are already enough to model the background in most videos, which could be composed of dozens of frames or even hundreds of frames.
Fig.2 illustrates the process of our algorithm. We are not the first who introduce sparse representation and dictionary learning into background modeling. However, different from the previous works CXW2011 ; YJJ2011 , we assume that the discriminative frames are the atoms of the dictionary. It’s a more reasonable assumption in practical problems. Besides, we use this process to refine the video, instead of modeling the background directly. The strategy of replacing the original data by the dictionary is effective, and it’s a reasonable dimension reduction method in solving large scale problems. Benefits from this process, the efficiency is improved largely and the needed memory is relaxed a lot. In next section, we continue our work on the selected ’discriminative’ set , instead of the original frame set .
3.2 Background extraction by cyclic iteration process
Our task in this section is to extract the background from the selected frame set. Benefits from the process in Section 3.1, the foreground objects in different frames are distinguishing. Thus the task of us can be concluded as Equation (9). The value of pixel () in the background is the linear weighted sum of all the corresponding values in the selected frames.
(9) 
where is the weight series for calculating the value of pixel (). And here reflects the error, which is sufficiently close to zero. As to different pixels, we will have to seek different corresponding weight series.
Fig.3 gives an example, different values of pixel () in different frames are represented by some colorful solid points. In some frames, the backgrounds are polluted by noises or covered by foreground object, thus the values are far away from the ground truth. We call them the outliers. Obviously, their weights should be extremely small. On the other hand, the inliers are around the truth in most frames. And their corresponding weights should be updated to large ones.
Here we propose a cyclic iteration process, in which we combine the pixelwise and the tensorwise thoughts. We use a tensor model to calculate the purifiedmean of all the frames. And based on the value of each pixel in the purifiedmean frame, we update the values of all the frames by a pixelwise outlier removal strategy.
3.2.1 Pixelwise strategy
As is shown in Fig.3, the values of a pixel in different frames () are around the truth. We calculate their purifiedmean, which is close to the mean of them. We use the purifiedmean to approximate the true value in Equation (9), and its calculating process will be introduced in next section.
The purifiedmean value may still be a little away from the ground truth. But it is much better than the worst outlier. So we replace the worst one with the purifiedmean value.
The replacing process is shown in Fig.4. It shows the outlier removal process of the pixel () in the th iteration. First, we get frames after the ()th iteration, and we pick out the values of pixel () in these frames, i.e.the colorful solid points in Fig.4. Second, we calculate the purifiedmean of the frames, then we can get the value of pixel () in the purifiedmean frame, i.e. the black solid point. Third, we find the value that is the most far away from the purifiedmean value, and replace it with the purifiedmean value. At last, new values are got, in fact, only one of the values is different from the original values.
The above process is only for one pixel. In the th iteration, we repeat this process for all the pixels. Thus, we get new frames after the th iteration, and the purifiedmean of them must be closer to the ground truth than last iteration. The iteration will continue until the value of each pixel in the purifiedmean frame converges to a constant.
The outlier removal process is in fact the weight updating process of Formula (9). Once we use the purifiedmean to replace the worst outlier, the corresponding weight is smaller than last iteration. But there’s no need for us to calculate the actual weights as what we care is the accumulation.
3.2.2 Tensorwise PCA model
In this section, we will set up a tensor model to calculate the purifiedmean of the selected frames. It is the mean of the frames at first, but moved a little away in the process of denoising. We solve the model by the modified Alternating Direction Multipliers (ADM) method, and the solution of it is limited to lie in a space.
The background is unchanged in our problem. In different frames, it should be all the same, i.e. , or the following:
(10) 
where is a oneorder tensor.
Constraint (10) in fact insists that background should be ’rank1’. First, we simplify the tensor by combining the tensor’s mode1 and mode2 into one single mode, just like the vectorize process in the previous works CXW2011 ; CTE2012 ; EGR2012 , i.e. . And we define . Then, we give Lemma 1.
Lemma 1.
In a static background problem, the rank of a simplified background tensor is: , if and only if the slices (of the tensor) in different color channels are nonlinear correlated.
The ranks of and are all 3. This is caused by the nonlinear correlation of different color channels. In fact, is the most important conclusion. If the frames we deal with are gray, then it is a rank1 constraint.
To solve Constraint (10), we consider a subspace of , which is denoted by . In this space there are all 4order tensors, of which all the frontal slices are the same (i.e. ). Obviously, is convex, it is easy to find the solution to the problem in this space.
Lemma 2.
Given a tensor , the solution to the optimization problem
(11) 
is . Actually, it is the average of the N frames.
All the former works in this section are for the whole frame series. They are also true for the frames in the selected set . As is illustrated before, what we want is the background part, or the static content. So we minimize the changing part to group more information into the background. Besides, we take strict Constraint (6) into account and give our model:
(12) 
where is the norm, that equals the sum of all the nonzero absolute values in the set. In the foreground region, we minimize the number of differenct elements in . Outside the foreground region, a pixel is composed of the mixture of the background and the noise, so we just minimize the noise . Benefits from the property of the norm, we arrange the two regions into one single formula, and give the objective function in Model (12).
Then we solve Model (12) by a modified ADM method. We first arrange the model. The model is to extract the background, or the unchanged part among the frames. Thus the changing content, either the foreground object or the noise, is nothing but pollution on the background. From this point, we denote all the nonbackground parts by . Then Model (12) is transformed into the following form:
(13) 
This model is much more simpler than Model (12). It is also a PCA model and the solution lies in the space.
In Model (13) the constraint can not be transformed into a single variable linear equation. What’s more, this constraint is strict, that cannot be relaxed. We will use this constraint as a correction term. Now, we consider the model:
(14) 
Obviously, the variables here are tensors, instead of matrices. We can still follow the idea of ADM, which uses a multiplier to form the augmented Lagrangian, i.e.
(15) 
where is the Lagrange multiplier.Then we get the iteration
(16) 
where is in fact a steplength parameter. is a softthreshold operator for tensor. And we get the iteration for by utilizing the following lemma:
Lemma 3.
Given a tensor , the solution to the optimization problem
(17) 
is , where
(18) 
Here the tensor plus (or minus) a single number means, all the elements of it plus (or minus) this number, respectively.
As we have illustrated before, the tensor must lie in the space. Once we get a new in one iteration, we project it onto the space. That means, we use its vertical projection on the space to replace it and continue the operation. In other words, we meet the constraint , and use to replace . Then the iteration is transformed into the following form:
(19) 
The process above copies the idea of ADM. However, we use it for a tensor model, through designing the softthreshold operator for tensor and utilizing some theories of tensor.
3.3 Algorithm formulation and convergence analysis
Here, we give our Sparse Outliers Iterative Removal Algorithm (SOIR). And its convergence condition is .
. 
. 
1:. 
2: 
3: : 
; 
, where 
; 
. 
4: : 
, 
, 
End 
5:end while. 
As to Algorithm 1, the sparse representation of the frames is the key part for guaranteeing that enough information is carried by the selected frames. The cyclic iteration process is the main part. In this process, we calculate the purifiedmean of the selected frames, it is used to update the frames by replacing the outliers pixelwise.
Now we discuss the convergence of SOIR. Here we prove that, for an arbitrary pixel (), the cyclic iteration process will return a solution. Then the conclusion also works for the tensor.
We know that there are frames in the selected set . Then for the pixel (), there are values as is shown in Fig.3. We assume that, the minimum is and the maximum is , thus all the values belong to the interval []. We record the minimum and the maximum of the values in the th iteration as and . The value of purifiedmean must be between and . Thus if the interval converges to one point, the purifiedmean series must converge to the same point, and it is the solution of out algorithm. In other words, we will have to prove that: .
First, for the , which is close to 0, we have: . This can be inferred from the process of the PCA model. The purifiedmean value is just around the mean of all the values, and it must be inside the interval if is small. Otherwise, the values in the th iteration must be close to each other, or to say, they converge to a point. Second, we know that . This is because that, in each iteration, the worst outlier is replaced by the purifiedmean value. It must belong to . Thus after iterations, the maximum and the minimum of all the values must be closer. Finally, ,,. When the number of iterations increases from 1 to infinity, the interval [] gets smaller and smaller. Then there must be a constant , we have .
We have to insist that, the solution may not be the ground truth. As mentioned above, the process is influenced by the property of the video. The performance in Section 5 will show that, the solution is pretty close to the ground truth if only the video is not so bad.
4 Foreground region detection
As illustrated before, a frame in the video is composed of three parts: the background, the foreground objects and all kinds of noise. In section 3, we efficiently compute the background tensor. Now, our task is to detect the foreground region.
4.1 Background subtraction
Background subtraction is a common method to detect the foreground region. And it is the first step in foreground detection.
Constraint (6) gives the formulation of the frame series in our model. It also works for each single frame , i.e.
(20) 
The result got in Section 3 is for the whole video. We denote the background of this frame by . Then we can get the result of subtraction, which is denoted by :
(21) 
We find that the residual background only exists in the foreground region, and outside this region there are nothing but the noises, i.e.
(22) 
From Expression (22), we can see the essence of background subtraction. The result depends on the properties of and , as well as the relationship between them. Thus it is easy to understand some socalled impossible works, for example, a white coat is hard to be detected when hanging on a white wall. If the distribution of is the same with that of , it’s almost impossible to detect the foreground region in addition to some video semantic analysis methods.
4.2 Foreground detection
In this section, we explore the foreground region for an arbitrarily given image from the original frame set . To simplify the problem, we transform the colorful frame into a gray one . In most cases, the foreground objects are contiguous pieces. We can model the region by a Markov Random Field SD1984 , just follow the idea of some previous works XCW2013 ; S2009 .
First, we set up a matrix to represent the foreground region :
(23) 
It’s easy to find that, a pixel is inside the foreground region if it is labeled with 1 in the matrix . Otherwise, the pixel must be lying outside the foreground region. Then the energy of can be got by the Ising modelS2009 :
(24) 
where and are two positive parameters, that penalize and , respectively.
Obviously, if we just minimize the energy of foreground region , it will converge to an empty set, i.e. . In the foreground detecting process, we also tend to allocate the major information of the background subtraction into the foreground part. So an important component of the objective function is . Besides, the nonzero elements outside the foreground region should also be minimized. Then we get the following model:
(25) 
The foreground detecting problem can be rearranged as follows:
5 Experimental analysis
In this section, we evaluate the performance of our Sparse Outliers Iterative Removal algorithm (SOIR). We explore the appropriate number of ’discriminative’ frames, test the performance of our algorithm and check its ability of solving the largescale problems. The experiments are operated on some real sequences from public datasets, like the I2R dataset LWIQ2004 , the flowerwall dataset KJBB1999 , etc. Besides, other sequences of the real video from the public resource are also included in our experiments. All the experiments are conducted and timed in Matlab R2010a on a PC with an Intel(R) Core(TM) 3.20GHz CPU and 4GB of RAM.
5.1 Number of the discriminative frames
A major work of this paper is that, while purifying the frame series of a video, we utilize the work of sparse representation to select the most discriminative frames. In this section, we will explore the appropriate number of the discriminative frames. We operate our measurement on the I2R dataset. We provide our details while experimenting on the ”Bootstrap” sequences of the I2R dataset at first. It is a video whose scene is in front of the buffet in a restaurant. We also use some other video sequences in the dataset and the corresponding results are also given.
We use the first 300 frames in the video sequence as our original frame set , and measure the performance of SOIR algorithm when the frame number of the selected frame set varies from one to thirty. Besides, we need a standard background for comparison. We use the first forty frames as the selected set to extract the standard background, because it is already pretty close to the globally optimal solution. The result is shown in Fig.5. We can see that, most of the extracted backgrounds are quite similar to the standard one, even if the number of the selected frames is small. However, it’s a little depress that the counter is not recovered exactly even in our standard background. It is limited by the properties of the video sequence. The two little fuzzy areas are in fact the spaces just in front of the buffet, and there are always people standing there and taking the meal in almost all the 300 frames. As a result, we have no evidence to prove that the two areas do not belong to the background.
Now, we measure the relationship between the rate of convergence and the number of discriminative frames. We compare each result with the standard one, and calculate the distance between them. Meanwhile, we use the distance between the standard background and the origin of the coordinate system as the standard distance. Then we use it to normalize all the distances. The result is shown in Fig.6.
We can see from Fig.6 that, the performance is not well enough when the number of frames is quite small. And it gets better and better as the number increases. When the number of the discriminative frames increases to about twenty, the distance ratio starts to fluctuate. The fluctuation is natural. In our process, it’s easy to see that all weights in the equation are nonzeros. When a new frame is added into the selected set as discriminative frame, and if the corresponding pixel is not in the foreground region, it will bring a negative influence on our performance. Although its weight will be smaller and smaller in the later iterations, the weight is always a positive one. In contrast, if the pixel reflects the information of the background, our performance will be better enough. Obviously, we can get rid of the fluctuation by improving the requirement on the convergence accuracy, bringing on an enormous growth of computation complexity. Next, we repeat these operations on some other sequences, and show the results in Fig.7.
From Fig.7 we can see that, when the number of the discriminative frames is small, the performance is not so good. As the number of the frames increases, the distance ratio decreases. We can also find that, when the frame number increases to more than twenty, the ratio tends to be stable. The solution of our model is fluctuating around the optimal solution under the allowed error. What’s more, the distance ratio of the ’Fountain’ sequence is almost close to be stable when the frame number is pretty small, this benefits from the property of the frame series, most of which are already the backgrounds although polluted by some noises. The result of the ’WavingTrees’ sequence is not so fascinating. In this sequence, the background keeps changing its shape all the time. Once a new frame is added into the selected set, we will have to adjust our result based on the new shape of the tree.
In this section, we study the relationship between the performance and the size of the selected set (equals the number of the discriminative frames). we can exactly recover the background from about twenty frames. Even if in some bad conditions, twentyfive frames are enough. This agrees with our former claim that, the result does not depend on the whole video, it’s the few ’discriminative’ frames that works. Most frames in the video series bring us nothing but some repeat information. Besides, we can also find that, the content of the video also affects the performance. In some special cases, a frame in the video may be just the exact background.
5.2 Experiments on large scale sequences
In practical applications, the resolution of video becomes higher and higher, and high definition cameras are widely applied now. The problem we are faced with is getting more and more difficult. Most traditional methods extract the background from the sequences whose resolution is around 150150, and the number of the frames is usually around 50. When the scale of the data increases, these methods tend to be slow. Although some pixelwise methods can still be quick enough by reducing the size of training set, their precision is influenced a lot.
In this section, we will check the ability of our model in solving the large scale problems. First, we pay attention to the number of the sequence. We operate our model on the data that consists of more than 1200 frames. We extract the background in four periods, i.e.,from the first 300 frames, the first 600 frames, the first 900 frames and the first 1200 frames, respectively. We show the time and the result in Fig.8.
When dealing with the sequences that the number is large, our model solves the background efficiently. Hundreds of frames only cost us dozens of seconds. As the number of frames increases, the precision of the extracted background is improved. Temporary stay is a problem that exists in most traditional background modeling methods. Once a person stays in a place for a while, he may be considered as a part of background in a short video. We can also see from the experiment that, as the number of the frames increases, the temporary stay is solved perfectly. The experiment on the ’Hall’ sequence just illustrates this. The person is regarded as background in the first period (300 frames), and is removed from the background after he goes away. We can also see from the result that, the time consumption is not linear with the number of the frames. In fact, the property of the video affects the speed. On one hand, the uncertainty of background caused by the temporary stay may cost some time. In the experiment on ’ForegroundAperture’, the person who lies on the table goes away after around the 500th frame. The process of the first 300 frames is pretty quick because all these frames are almost the same. The other three periods spend so much time to decide whether this person belongs to the background or not. This is also true for the last period (1200 frames) of the experiment on the ’Lobby’ sequence. On the other hand, the situation of the foreground and the noise also influences the time consumption. The waving trees in the ’Campus’ sequence and the person in the ’Moved Object’ sequence really take us some time to get rid of them in the process of extracting the background.
Second, we will try our model on the video sequences whose resolution is much higher than the usual dataset. Here we use four video sequences for our experiments, one of which is the ’ShoppingMall’ sequence of the I2R dataset. And the other three are the intersection monitoring videos from the public resource. We test our model in the first 50 frames and the first 150 frames of each sequence. The result is shown in Fig.9.
If the resolution increases from 320256 to 720576, the time consumption also increases sharply. In fact the pixels in each frame of the last video are more than 5 times of the first one. Our model spends about 100 seconds to solve the high resolution video that consists of 150 frames. When solving some medium resolution video, our model only demands dozens of seconds. Besides, we can also conclude from Fig.9 that, the content of the video will influence our performance. In the first video, the temporary stay is identified as the background at first, but is weeded out after the frames’ number increases. In the third video, the distant cars move slowly in the fixed lens due to the perspective. It’s actually the approximately temporary stay. We can find that 150 frames are still not enough to get rid of this phenomenon, and more frames are needed to solve these practical problems.
resolution: number  PCP  DECOLOR  SOIR 

109.72  217.37  9.08  
231.47  569.89  9.98  
370.57  1173.13  11.55  
465.70  1389.76  17.63 
For comparison, we also examine the time consumption of those PCA models, i.e. the Principal Component Pursuit (PCP) EXYJ2011 and the Detecting Contiguous Outliers in the LowRank Representation (DECOLOR) XCW2013 , which perform well in smallscale problems. The results in different scales (different resolutions and different frame numbers) of data are shown in Table 1.
As is shown in Table 1, PCP is faster than DECOLOR, because DECOLOR pays more attention to the precision. But they are all much slower than SOIR. The advantage of SOIR gets absolute when the size of the data gets larger. When the frame number is 150, SOIR is about 10 times faster than PCP. And it is 30 times faster than PCP when the frame number increases to 450. Besides, when the resolution increases from 160 120 to 320 256, SOIR is also 30 times faster than PCP. If the scale of the data is large, the major time consumption of SOIR is to explore the ’discriminative’ frames. Once we get these frames, we can model the background from them fastly and exactly.
From the experiments operated in this section, we can see our model’s ability of solving the largescale problems. The problems are solved quickly. This is because we solve the problem from the explored ’discriminative’ frames. Obviously, it takes us some time to finish the exploring process, but the process helps us get rid of the repeated and useless frames, thus saves us much more time in modeling the background.
5.3 Detecting the foreground
In this section, we detect the foreground region based on the calculated background. We also compare our performance with some other researches, i.e.the MOG CW1999 , the PCP EXYJ2011 , and the DECOLOR XCW2013 . The DECOLOR models the background by a lowrank matrix and models the foreground by MRF.
For quantitatively evaluate the performance of different algorithms, we compute the precision and the recall as follows:
(27) 
where , , , are the true positive, false positive, true negative, false negative detections, respectively. When the recall is high enough without altering the precision, the performance is good. We use the measurement called Fmeasure to check this:
(28) 
In the experiment, we use the video sequences from the I2R dataset and flowerwall dataset. In these sequences, the handsegmented foreground region of some frames are already given out. Thus, once we detect the foreground based on the result of the background extraction, we compare the detected foreground region with the given ground truth and calculate the corresponding Fmeasure. The test frame is chosen randomly from all the handsegmented frames. To avoid the influence of the temporary stay, we need enough frames to extract a reliable background. Here we use 250 frames, of which the last frame is the test frame, to form the data set in the background extracting process. Deservedly, the background is for all the 250 frames. We can detect the foreground regions for each of them with the background.
The sequences and the results are shown in Fig.10. In the experiments, SOIR can exactly extract the background for almost all the sequences, while the last video (g) is an exception. Because the person stays there all the time, the set of 250 frames is not enough for the temporary stay in this video, while it is enough for the other six videos. We can see that, SOIR also performs well in the task of foreground detection. This benefits from the accurate result of the background as well as the model of MRF. DECOLOR performs well as it also models the foreground by MRF. In most sequences, the results of SOIR are better than those of DECOLOR, because the extracted backgrounds of us are more exact. The other two nonMRF methods detect the foreground region roughly. In fact, they can find the border of the foreground region exactly, while being bothered by the complex noises. Then we present the corresponding Fmeasures of Fig.10.
Sequence  SOIR  PCP  MOG  DECOLOR 

(a)  0.9737  0.6110  0.2047  0.5669 
(b)  0.9020  0.7129  0.3841  0.8244 
(c)  0.8452  0.6986  0.5406  0.7225 
(d)  0.8314  0.5248  0.2498  0.6439 
(e)  0.8170  0.6046  0.4014  0.8966 
(f)  0.7972  0.5902  0.2455  0.6487 
(g)  0.6382  0.5104  0.1962  0.3941 
Table 2 gives out the Fmeasures of all the detected foreground region in Fig.10. We can see that the results of SOIR are better than the other three methods in six sequences, i.e.(a),(b),(c),(d),(f),(g). And it is a little worse than DECOLOR in the sequence (e). We can also find that, the performance of SOIR varies among different videos. On one hand, it’s due to the result of background extraction. It is the case for the sequence (g). On the other hand, the instability of the video’s background also affects the performance of SOIR. The waving trees and the moving escalator in sequence (e) and sequence (f) respectively are all blocks in the foreground detection process. In summary, the performance of SOIR is competitive, the advantage is obvious even when the property of the video is bad.
6 Conclusion and future work
In this paper, we propose a Sparse Outliers Iterative Removal (SOIR) algorithm to model the background of a video. We find that, a few ’discriminative’ frames are already enough to model the background. We propose the sparse representation process to refine the original data set. Although exploring the ’discriminative’ frames wastes us some time, it saves us much more time in modeling the background. Besides, we propose a cyclic iteration process to extract the background, which combines a tensorwise PCA model and a pixelwise outlier removal strategy. SOIR achieves high accuracy as well as high speed simultaneously in dealing with the real video sequences. Especially, it shows its great advantage in largescale problems.
In the future work, we will deal with some more complex problems, in which the background is no longer stable among different frames. The work of sparse representation will be further combined with the background modeling problem.
Acknowledgments
This work was partially supported by the National Natural Science Foundation of China (No. 51275348, No. 61379014, No. 6122210), and New Century Excellent Talents in University (Grant No. NCET120399).
References
References

(1)
C.Stauffer, W.Grimson, Adaptive background mixture models for realtime tracking, in:IEEE Conference on Computer Vision and Pattern Recognition(CVPR),1999.
 (2) H.Li, C.Shen, Q.Shi, Realtime visual tracking using compressive sensing, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2011, 13051312.
 (3) K.Zhang, L.Zhang, M.Yang, Realtime compressive tracking, in:European Conference on Computer Vision(ECCV), 2012, 864877.
 (4) T.Bouwmans, Recent advanced statical background modeling for foreground detectionA systematic survey, Recent Patents on Computer Science 6(3)(2011) 147176.
 (5) A.Yilmaz, O.Javed, M.Shah, Object tracking: A survey, ACM Computering Survey(CSUR) 38(4)(2006) 145.
 (6) S.Li, H.Lu, L.Zhang, Arbitrary body segmentation in static images, Pattern Recognition 45(9)(2012) 34023413.
 (7) D.Park, H.Byun, A unified approach to background adaptation and initialization in public scenes, Pattern Recognition 46(7)(2013) 19851997.
 (8) M.Droogenbroeck, O.Paquot, Background subtraction: Experiments and improvements for vibe, in:IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshop(CVPRW), 2012, 3237.
 (9) V.Mahadevan, N.Vasconcelos, Background subtraction in highly dynamic scenes, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2008, 16.
 (10) Y.Chen, C.Chen, C.Huang, Y.Hung, Efficient hierarchical method for background subtraction, Pattern Recognition 40 (10)(2007) 27062715.
 (11) S.Cheung, C.Kamath, Robust techniquences for background subtraction in urban traffic video, Proceedings of Video Communications and Image Processing(VCIP), 2004, 881892.
 (12) S.Wang, T.Su, S.Lai, Detecting moving objects from dynamic background with shadow removal, in:IEEE International Conference on Acoustics Speech and Signal Processing(ICASSP), 2011, 925928.
 (13) A.Ulges, T.Breuel, A local discriminative model for background subtraction, Pattern Recognition. Springer Berlin Heidelberg, 2008, 507516.
 (14) C.Wren, A.Azarbayejani, T.Darrell, A.Pentland, Pfinder:Realtime tracking of human body, IEEE Transactions on Pattern Analysis and Machine Intelligence 19(7)(1997) 780785.
 (15) P.KaewTraKuiPong, R.Bowden, An improved adaptive background mixture model for realtime tracking with shadow detection, VideoBased Surveillance Systems, 2002, 135144.

(16)
Z.Zivkovic, Improved adaptive Gaussian mixture model for background subtraction, in:Proceedings of the 17th International Conference on Pattern Recognition, 2004, 2831.
 (17) D.Lee, Effective Gaussian mixture learning for video background subtraction, IEEE Transactions on Pattern Analysis and Machine Intelligence 27(5)(2005) 827832.
 (18) K.Kim, T.Chalidabhongse, D.Harwood, L.Davis, Background modeling and subtraction by codebook construction, in:IEEE International Conference on Image Processing(ICIP), 2004, 30613064.
 (19) K.Kim, T.Chalidabhongse, D.Harwood, L.Davis, Realtime foreground Cbackground segmentation using codebook model, Realtime Imaging 11(3)(2005) 172185.
 (20) J.Guo, Y.Liu, C.Hsia, C.Hsu, Hierarchical method for foreground detection using codebook model, IEEE Transactions on Circuits and Systems for Video Technology 21(6)(2011) 804815.
 (21) A.Zaharescu, M.Jamieson, Multiscale multifeature codebookbased background subtraction, in:IEEE International Conference on Computer Vision workshops, 2011, 17531760.
 (22) A.Hamad, N.Tsumura, Background subtraction based on timeseries clustering and statistical modeling, Optical Review 19(2)(2012) 110120.

(23)
A.Elgammal, D.Harwood, L.Davis. Nonparametric model for background subtraction, in:Computer Vision ECCV 2000 Springer Berlin Heidelberg, 2000, 751767.

(24)
E.LearnedMiller, M.Narayana, A.Hanson, Background modeling using adaptive pixelwise kernel variances in a hybrid feature space, in:IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2012, 21042111.
 (25) O.Barnich, M.Van Droogenbroeck, ViBe: A universal background subtraction algorithm for video sequences, IEEE Transactions on Image Processing 20(6)(2011) 17091724.
 (26) L.Li, W.Huang, I.Gu, Q.Tian, Statistical modeling of complex backgrounds for foreground object detection, IEEE Transactions on Image Processing 13(11)(2004) 14591472.
 (27) M.Heikkila, M.Pietikainen, A texturebased method for modeling the background and detecting moving objects, IEEE Transactions on Pattern Analysis and Machine Intelligence 28(4)(2006) 657662.
 (28) L.Zhang, W.Dong, X.Wu, G.Shi, SpatialTemporal color video reconstruction from noisy CFA sequence, IEEE Transactions on Circuits and Systems for Video Technology 20(6)(2010) 838847.
 (29) S.Liao, G.Zhao, V.Kellokumpu, M.Pietikainen, S.Li, Modeling pixel process with scale invariant local patterns for background subtraction in complex scenes, in:IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2010, 13011306.
 (30) N.Oliver, B.Rosario, A.Pentland, A bayesian computer vision system for modeling human interactions, IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8)(2000) 831843.
 (31) Z.Zhou, X.Li, J.Wright, E.Candes, Y.Ma, Stable principal component pursuit, in: IEEE International Symposium on Information Theory Proceedings(ISIT), 2010, 15181522.
 (32) E.Candes, X.Li, Y.Ma, and J.Wright, Robust principal component analysis?, Journal of the ACM(JACM) 58(3)(2011) No.11.
 (33) G.Tang, A.Nehorai, Robust principal component analysis based on lowrank and blocksparse matrix decomposition, in:IEEE Annual Conference on Information Sciences and Systems(CISS), 2011, 15.
 (34) C.Guyon, T.Bouwmans, E.Zahzah, Foreground detection based on lowrank and blocksparse matrix decomposition, in:IEEE International Conference on Image Processing(ICIP), 2012, 12251228.
 (35) X.Zhou, C.Yang, W.Yu, Moving object detection by detecting contiguous outliers in the lowRank representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 35(3)(2013) 597610.

(36)
J.Wright, A.Yang, A.Ganesh, S.Sastry, Y.Ma, Robust face recognition via sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2)(2009) 210227.
 (37) W.Zuo, D.Meng, L.Zhang, X.Feng, D.Zhang, A generalized iterated shrinkage algorithm for nonconvex sparse coding, in:IEEE International Conference on Computer Vision(ICCV), 2013.
 (38) Z.Feng, M.Yang, L.Zhang, Y.Liu, D.Zhang, Joint discriminative dimensionality reduction and dictionary learning for face recognition, Pattern Recognition 46(8)(2013) 21342143
 (39) Y.Cong, J.Yuan , J.Liu, Sparse reconstruction cost for abnormal event detection, in:IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2011, 34493456.
 (40) T.Guha, R.Ward, Learning sparse representations for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(8)(2012) 15761588.
 (41) C.Zhao, X.Wang , W.Cham, Background subtraction via robust dictionary learning, EURASIP Journal on Image and Video Processing, 2011.
 (42) E.Elhamifar, G.Sapiro, R.Vidal, See all by looking at a few: Sparse modeling for finding representative objects, in:IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2012, 16001607.
 (43) B.Bader, T.Kolda, Algorithm 862: MATLAB tensor classes for fast algorithm prototyping, ACM Transaction on Mathmatical Software(TOMS) 32(4)(2006) 635653.
 (44) T.Kolda, B.Bader, Tensor decompositions and applications, Society for Industrial and Applied Mathematics(SIAM) 51(3)(2009) 455500.
 (45) S.Geman, D.Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6)(1984) 721741.
 (46) S.Li, Markov random field modeling in image analysis, Springer Publishing Company, 2009.
 (47) Y.Boykov, O.Veksler , R.Zabih, Fast approximate energy minimization via graph cuts, IEEE Transactions on Pattern Analysis and Machine Intelligence 23(11)(2001) 12221239.
 (48) K.Toyama, J.Krumm, B.Brumitt, B.Meyers, Wallflower: Principles and practice of background maintenance, in:IEEE International Conference on Computer Vision(ICCV), 1999, 255261.
Comments
There are no comments yet.