Video processing is one of the main branches in image processing and computer vision, which is targeted to extract knowledge from videos collected from real scenes. As an essential and fundamental research topic in video processing, background subtraction has been attracting increasing attention in the recent years. The main aim of background subtraction is to separate moving object foreground from the background in a video, which always makes the subsequent video processing tasks easier and more efficient. Typical applications of background subtraction include object tracking, urban traffic detection , long-term scene monitoring , video compression  and so on.
The initial strategies proposed to handle background subtraction are to directly distinguish background pixels from foreground ones through some simple statistical measures, like the median(mean) model [5, 6] and some histogram models . Later, more elaborate statistical models, like the MOG  and MOGG  models, were presented to better deliver the distributions of the image pixels located in the background. These methods, however, ignore very useful video structure knowledge, like temporal similarity of background scene and spatial contiguity of foreground objects, and thus always cannot guarantee a good performance especially under complex scenarios. In recent decades, low-rank subspace learning models [10, 11] represent a new trend and achieve state-of-the-art performance for this task due to their better consideration of video structure knowledge both in foreground and background. Especially, these methods assume a rational low-rank structure for video backgrounds, which encodes the similarity of video backgrounds along time, and mostly consider useful prior foreground structures, like sparsity and spatial continuity. Some typical models along this line are , , , , .
Albeit substantiated to be effective in some video sequences with fixed lengthes, there is still a gap of utilizing such offline methodologies to real video processing applications. Specifically, it is known that the amount of videos nowadays is dramatically increasing from surveillance cameras scattered all over the world. This not only makes it critical to calculate background subtraction from such large amount of videos, but also urgently requires to construct real-time techniques to handle the constantly emerging videos. Online subspace learning has thus become an important issue to alleviate this efficiency issue. Very recently, multiple online methods for background subtraction have been designed [17, 18, 13], which speedup the computation by gradually updating the low-rank structure under video background through incrementally treating only one frame at a time. Such online amelioration always significantly speeds up the calculation for the task, and makes it possible to efficiently handle the task even in real time under large-scaled video contexts.
However, the current online background subtraction techniques still have evident defects when being applied to real videos. On one hand, most current methods assume a low-rank structure for video background while neglect frequently-occurring dynamic camera jitters, such as translation, rotation, scaling and light/shade change, across video sequences. Such issues, however, always happen in real life due to camera status switching or circumstance changing over time and tend to damage the conventional low-rank assumption for video backgrounds. Actually, the image sequence formed by slightly translating/rotating/scaling each of its single images will always have no low-rank property at all. Thus the performance of current methods tend to be evidently degenerated in such background-changing cases, and it should be critical to make the online learning capable of adapting such camera jitters.
On the other hand, all current online methods for this task used a fixed loss term, e.g., or
losses, in their models, which implicitly assume that noises (foregrounds) involved in videos follow a fixed probability distribution, e.g., Gaussian or Laplacian. Such assumption, however, deviates from the real scenarios where the foregrounds always have dramatic variations over time. E.g., in some frames there are no foreground objects existed, where noises can be properly modeled as a Gaussian (i.e.,-norm loss), in other cases there might be an object occluding a large area in the background, where noises should be better modeled as a long tailed Laplacian (i.e., -norm loss), while in more often cases, the foreground might contain multiple modalities of noises, as those depicted in Fig. 1, which require to consider more complex noise models. The ignoring of such important insight of video foreground diversity always makes current methods not robust enough to finely adapt real-time foreground/noise variations in practice.
To alleviate the aforementioned issues, in this work we propose a new online background subtraction method. The contribution can be summarized as follows:
Firstly, instead of using fixed noise distribution throughout all video frames as conventional, the proposed method models the noise/foregound of each video frame as a separate mixture of Gaussian (MoG) distribution, regularized by a penalty for enforcing its parameters close to those calculated from the previous frames. Such penalty can be equivalently reformulated as the conjugate prior, encoding the noise knowledge previous learned, for the MoG noise of current frame. Due to the good approximation capability of MoG to a wide range of distributions, our method can finely adapt video foreground variations even when the video noises are with dynamic complex structures.
Secondly, we have involved an affine transformation operator for each video frame into the proposed model, which can be automatically fitted from the temporal video contexts. Such amelioration makes our method capable of adapting wide range of video background transformations, like translation, rotation, scaling and any combinations of them, through properly aligning video backgrounds to make them residing on a low-rank subspace in an online manner. The proposed method can thus perform evidently more robust on the videos with dynamical camera jitters as compared with previous methods.
Thirdly, the efficiency of our model is further enhanced by embedding the sub-sampling technique into calculation. By utilizing this strategy, the proposed method can be accelerated to execute more than frames per second on average (in Matlab platform), while still keeping a good performance in accuracy, which meets the real-time requirement for practical video processing tasks. Besides, attributed to the MoG noise modeling methodology, the separated foreground layers always can be interpreted with certain physical meanings, as shown in Fig. 1, which facilitates us to get more intrinsic knowledge under video foreground.
Fourthly, our method can be easily extended to other subspace alignment tasks, like image alignment and video stabilization applications. This implies the good generalization of the proposed method.
The paper is organized as follows: Section 2 reviews some related works. Section 3 proposes our model and related algorithms. Its sub-sampling amelioration and other extensions are also introduced in this section. Section 4 shows experimental results on synthetic and real videos, to substantiate the superiority of the proposed method. Discussions and concluding remark are finally given.
2 Related Work
2.1 Low Rank Matrix Factorization
low rank matrix factorization (LRMF) is one of the most commonly utilized subspace learning approaches for background subtraction. The main idea is to extract the low-rank approximation of the data matrix from the product of two smaller matrices, corresponding to the basis matrix and coefficient matrix, respectively. Based on the loss terms utilized to measure the approximation extent, the LRMF methods can be mainly categorized into three classes. -LRMF methods  utilizes -norm loss in the model, implicitly assuming that the noise distribution in data is Gaussian. Typical -LRMF methods include weighted SVD , WLRA , -Wiberg -LRMF methods are the most typical ones. The -LRMF utilizes the loss term, implying that the data noise follows a Laplacian distribution. Due to the heavy-tailed characteristic of Laplacian, such method always could perform more robust in the presence of heavy noises/outliers. Some commonly adopted -LRMF methods include: L1Wiberg , RegL1MF , PRMF  and so on. To adapt more complex noise configurations in data, several models have recently been proposed to encode the noise as a parametric probabilistic model, and accordingly learn the loss term as well as the model parameters simultaneously. In this way, the model is capable of adapting wider range of noises as compared with the previous ones with fixed noise distributions. The typical methods in this category include the MoG-LRMF [16, 25] and MoEP-LRMF  methods, representing noise distributions as a MoG and a mixture of exponential power distributions, respectively. Despite having been verified to be effective in certain scenarios, these methods implicitly assume stable backgrounds across all video frames and fixed noise distribution for foreground objects throughout videos. As we have analyzed, neither is proper for practically collected videos, which tends to degenerate their performance.
2.2 Background Subtraction
As a fundamental research topic in video processing, background subtraction has been investigated widely nowadays. The initial strategies mainly assumed that the distribution (along time) of background pixels can be distinguished from that of foreground ones. Thus by judging if a pixel is significantly deviated from the background pixel distribution, we can easily categorize if a pixel is located in background/foreground. The simplest methods along this line directly utilize a statistic measure, like the median  or mean  to encode background knowledge. Later more complex distributions on background pixels, like MOG , MOGG  and so on  , are more effective. The disadvantage of these methods is that they neglect useful video structure knowledge, e.g., temporal similarity of background scene and spatial contiguity of foreground objects, and thus always cannot guarantee a good performance practically. Low-rank subspace learning models represent the recent state-of-the-art for this task on general surveillance videos due to their better consideration of video structures. These methods implicitly assume stable background in videos, which are naturally with a low-rank structure. Multiple models have been raised on this topic recently, typically including PCP , GODEC , and DECOLOR 
. Albeit obtaining state-of-the-art performance in some benchmark video sets, these methods still cannot be effectively utilized in real-time problems due to both their simplified assumptions in video backgrounds (with stationary background scenes) and foreground (with fixed type of noise distributions along time). They also tend to encounter efficiency problem for real-time requirements, especially for large scaled videos. Very recently, some deep neural network works[29, 30] were also attempted on the task against specific scenes, while need large amount of pre-annotations. In this paper we mainly focus on handling general surveillance videos without any supervised foreground/background knowledge, and thus have not considered this approach in our experiment comparison.
2.3 Online Subspace Learning
Nowadays, it has been attracting increasing attention to design online subspace learning method to handle real-time background subtraction issues [31, 32]. The basic idea is to calculate only one frame at a time, and gradually ameliorate the background based on the real-time video variations. The state-of-the-art methods along this line include GRASTA , OPRMF , GOSUS , PracReProCS  and incPCP [34, 35]. GRASTA used a norm loss for each frame to encode sparse foreground objects, and employed ADMM strategy for subspace updating. Similar to GRASTA, OPRMF also optimized a -norm loss term while added regularization terms to subspace parameters to alleviate overfitting. GOSUS designed a more complex loss term to encode the structure of video foreground, and the updating algorithm is designed similar to that of GRASTA. Besides, PracReProCS and incPCP were recently proposed, which are the incremental extensions of the classical PCP algorithm.
However, these methods are still deficient due to their insufficient consideration on variations both in background and foreground in real videos. On one hand, they assume a low-rank structure for the video background, which ignores very often existed background changes and camera jitters across video sequences. On the other hand, they all fix the loss term in their models, which implicitly assumes that noise involved in data is generated from a fixed probability distribution. This, however, under-estimates the temporal variations of the foreground objects in videos. That is, in some frames the foreground signals might be very weak while in others they might be very evident. The noise distributions are thus not fixed while varying across video frames. The underestimation of both foreground/background knowledge incline to degenerate their capability for real online tasks.
2.4 Robust Subspace Alignment
Recently, multiple subspace learning strategies have been constructed to learn transformation operators on video frames to make the methods robust to camera jitters. A typical method is RASL (robust alignment by sparse and low-rank decomposition) 
, which poses the learning of transformation operators into the classical robust principal component analysis (RPCA) model, and simultaneously optimize the parameters involved in such operators as well as the low-rank (background) and sparse (foreground) matrices. Other similar works are extended by[37, 38, 39]. However, such batch-mode methods are generally slow to run and can only deal with moderate scaled videos. To this issue, incPCP_TI  is extended from incPCP by taking translation and rotation into consideration to deal with image rigid transformation. t-GRASTA  realized a more general subspace alignment by embedding an affine transformation operator into online subspace learning. Although capable of speeding up the offline methods, the methods utilized a simple -norm loss to model foreground. This simple loss cannot reflect dramatic foreground variations always existed in real videos due to the fact that a simple Laplacian cannot finely reflect the complex configurations of video foregrounds. This deficiency inclines to degenerate its performance on online background subtraction. Comparatively, our proposed method fully encodes both dynamic background and foreground variations in videos, and thus is always expected to attain a better background subtraction performance, as depicted in Fig. 2.
3 Online MoG-LRMF
We first briefly introduce the MoG-LRMF method , which is closely related to the modeling strategy for foreground variations in our method.
3.1 MoG-LRMF Review
Let be the given data matrix, where denote the dimensionality and number of data, respectively, and each column is a -dimensional measurement. A general LRMF problem can be formulated as:
where and denote the basis and coefficient matrices, with , implying the low-rank property of . is the indicator matrix of the same size as , with if is missing and otherwise. denotes the power of an norm, most commonly adopted as and norms in the previous research. Eq. (1) can also be equivalently understood under the maximum likelihood estimation (MLE) framework as:
where are the and
row vectors ofand , respectively, and denotes the noise element embedded in . Under the assumption that the noise follows a Gaussian/Laplacian distribution, the MLE model exactly complies with Eq. (1) with / norm loss terms. This means the /-norm LRMF implicitly assume that the noise distribution underlying data follows a Gaussian/Laplacian distribution. Such simple assumption always deviates from real cases, which generally contain more complicated noise configurations [16, 26].
To make the model robust to complex noises, the noise term can be modeled as a parametric probability distribution to let it more flexibly adapt real cases. Mixture of Gaussian (MoG) is naturally selected for this task  due to its strong approximation capability to general distributions. Specifically, by assuming that each follows
under the i.i.d. assumption, we can then get the log-likelihood function as follows:
where and denote the mixture rates and variances involved in MoG, respectively. The EM algorithm  can then be readily utilized to estimate all parameters in the model, including the responsibility parameters (in E-step), the MoG parameters , , where , and the subspace parameters , through solving a weighted--LRMF problem :
3.2 Online MoG-LRMF: Model
3.2.1 Probabilistic Modeling
The main idea of the online MoG-LRMF (OMoGMF) method is to gradually fit a specific MoG noise distribution for foreground and a specific subspace for background for each newly coming frame along the video sequence, under the regularizations of the foreground/background knowledge learned from the previous frames. The updated MoG noise parameters include , , which are regularized under the previous learned noise knowledge , and . The updated subspace parameters include the coefficient vector for and the current subspace , required to be regularized under the previous learned subspace .
The model of OMoGMF can be deduced from a MAP (Maximum a posteriori) estimation by assuming a separate MoG noise distribution on each newly coming framein Eq. (2), and then we have:
where means the pixel of and Multi denotes the multinomial distribution. We then formulate prior terms for foreground and background, respectively. For the MoG parameters of foreground, we set the natural conjugate priors to and , which are the Inverse-Gamma and Dirichlet distributions , respectively, as follows:
where , . It can be calculated that the maximum of the above conjugate priors are and
. This implies that the priors implicitly encode the previously learned noise knowledge into the OMoGMF model, and help rectify the MoG parameters of the current frame not too far from the previous learned ones. For the subspace of background, a Gaussian distribution prior can be easily set for its each row vector:
where is a positive semi-definite matrix.This facilitate the to-be-learned subspace variable U being well regularized by the previously learned . Details of how to set will be introduced in Sec. 3.4. To make a complete Bayesian model, we also set a noninformative prior for , which does not intrinsically influence the calculation. The full graphical model is depicted as Fig. 3.
3.2.2 Objective Function
All hyperparameters are denoted by, and after marginalizing the latent variable , we can get the posterior distribution of in the following form:
Based on the MAP principle, we can get the following minimization problem for calculating :
In the above problem, the first term is the likelihood term, which enforces the learned parameters adapt to the current frame . The second term is the regularization term for noise distribution, whose function can be more intuitively interpreted by the following equivalent form:
where , denotes the KL divergence between two distributions. It can be evidently observed that functions to rectify the foreground distribution on the current frame (with parameters ) to approximate the previously learned one (with parameters ). Besides, the third term in (9) corresponds to a Mahalanobis distance between each row vector of to that of , thus functioning to rectify the current learned subspace by the previously learned one. The compromising parameter and control the strength of the priors, and their physical meanings and setting manners will be introduced in Sec. 3.4.
To easily compare differences of our model with the previous ones, we list typical models along this research line, as well as ours, in Table I.
|Objective Function||Constraint/Basic Assumption||Implementation Scheme|
Online: Heuristically update
|OPRMF ||Online: Heuristically update|
|GOSUS ||Online: Heuristically update|
3.3 Online MoG-LRMF: Algorithm
The online-EM algorithm can be readily utilized for solving the OMoGMF model (9), by alternatively implementing the following E-step and M-step on a new frame sample .
Online E Step
: As the traditional EM strategy, this step aims to estimate the expectation of posterior probability for latent variable, which is also known as responsibility . The updating equation is as follows:
Online M Step: On updating MoG parameters , we need to minimize the following sub-optimization problem:
The closed-form solution is111Inference details are listed in the supplementary material (SM).:
On updating coefficient parameter v, we need to solve the sub-optimization problem of (9) with respect to v as:
where each element of is for . This problem is a weighted least square problem, and has the closed-form solution as:
On updating the subspace parameter , we need to solve the following sub-problem of (9):
and it has closed-form solution for each its row vector as:
In order to get a simple updating rule, we set
and then we have . By using matrix inverse equation  and the equation , the update rules for and can be reformulated as:
Thus in each step of updating , we only need to save , calculated in the last step, which only needs fixed storage memory. Note that since the matrix inverse computations are avoided in the above updating equations, the efficiency of the algorithm is guaranteed.
Since the subspace, representing the background knowledge, changes relatively slowly along the video sequence, we only fine-tune once after recursively implementing the above E-M steps on updating , , , and until convergence for each new sample under fixed subspace . The subspace can then be fine-tuned to adjust the temporal background change in this video frame. Note that there are only simple computations involved in the above updating process, except that in (16), we need to compute the inverse of a matrix. In the background subtraction contexts, the rank is generally with a small value and far less than . We thus can very efficiently calculate this matrix inverse in general.
The OMoGMF algorithm can then be summarized in . About initialization, we need a warm-start for starting our algorithm by running PCA on a small batch of starting video frames to get an initial subspace, employing MoG algorithm on the extracted noise to get initial MoG parameters, and calculating the initial , for subspace learning.
3.4 Several remarks
On relationship between conjugate prior and KL divergence:
Actually we can prove a general result to understand the conjugate prior as an equivalent KL divergence regularization. For the fully exponential family distributions, we have the following theorem222All proofs are presented in SM due to page limitation.:
Theorem 1 If a distribution belongs to the full exponential family with the form: and its conjugate prior follows: then we have:
where and is a constant independent of .
Since both Gaussian and multinomial distributions belong to the full exponential family, both conjugate priors in (6) can be written in their equivalent KL divergence expressions (10). We prefer to use the latter form in our study since it can more intuitively deliver the noise regularization insight underlying our model in a deterministic manner.
On relationship to batch-mode model:
Under the model setting of (9) (especially for the two regularization terms and ), there is an intrinsic relationship between our online model incrementally implemented on current sample with a batch-mode one on all learned samples , as described in the following theorem:
Theorem 2 By setting and , minimizing (12) for and (17) for are equivalent to calculating:
respectively. Moreover, under these settings, it holds that:
The above result demonstrates the batch-mode understanding of our online learning schemes, under fixed previously learned variables (), which have not been stored in memory in the online implementation manner and cannot be re-calculated on previous frames.
On parameters and : Although the natural choices for them are and based on Theorem 2, under these settings, the prior knowledge learned from previous frames will be gradually accumulated (note that the value of will increase to infinity), and the function of the likelihood term (i.e., the effect of the current frame) will be more and more alleviated with time. However, as the motivation of this work, we expect that our method can consistently fit the foreground variations and dynamically adapt the noise changes with time, and thus hope that the likelihood term can constantly play roles in the computation. In our algorithm, we just easily set as a fixed constant ( correspondingly), meaning that we dominate the adjacent frames to rectify the online parameter updating of the current frame. In practical cases, a moderate (e.g., we set it as in all our experiments) is preferred to make the method adaptively reflect temporal variations of video foreground, while not too sensitive to single frame change, as clearly depicted in Fig. 4. Similarly, we easily set as throughout our experiments to let the updated subspace slightly lean to the current frame.
3.5 Efficiency and Accuracy Amelioration
We then introduce two useful techniques to further enhance the efficiency and accuracy of the proposed method.
It can be shown that a large low-rank matrix can be reconstructed from a small number of its entries  under certain low-rank assumption. Inspired by some previous attempts  on this issue, we also prefer to use sub-sampling technique to further improve efficiency of our method.
For a newly coming frame , we randomly sample some of its entries to get the sub-sampling data , where is the index set of the sampled entries, and then we only use to update the parameters involved in our model. The updating of MoG parameters and is similar to the original method, and can be solved under the sampled subspace . While for , we only need to update its row entries on through using and .
Generally speaking, the sub-sampling rate is inversely proportional to the performance of our method, and we thus need to find a trade-off between efficiency and accuracy in real cases. E.g., in evidently low-rank background cases, the sampling rate should be larger while for scenes with complex backgrounds across videos, we need to sample more data entries to guarantee accuracy.
3.5.2 TV-norm regularization
The foreground is defined as any objects which are occluded before the background during a period of time. In real-world scenes, as we know, one foreground object often appears in a continuous shape and the region of one object generally is with an evident spatial smoothness. In our online method, we also consider such spatial smoothness to further improve its accuracy on foreground object detection.
There are several strategies to encode the smoothness property of an image, e.g., Markov random field (MRF) [45, 3, 46], Total Variation (TV) minimization [4, 47], and structure sparsity [18, 48]. Considering effectiveness and efficiency, we employ the TV-minimization approach in our method. For a foreground frame obtained by our method, we calculate the following TV-minimization problem:
where is the TV norm and denotes the foreground got by our method. The optimization problem (22) can be readily solved by TV-threshold algorithm [49, 50]. In our experiments, we just empirically set as about ( is the largest variance among MoG components), and our method can perform well throughout all our experiments.
3.6 Transformed Online MoG-LRMF
Due to camera jitter or circumstance changes, videos collected from real scenes always contain background changes over time, which tends to hamper the low-rank assumption on subspace learning models. A real-time alignment is thus required to guarantee the soundness of the online subspace learning methods on the background subtraction task. To this aim, we embed a transformation operator into our model and optimize its parameters as well as other subspace and noise parameters to facilitate our model adaptable to such misaligned videos. Specifically, for a newly coming frame , we aim to learn a transformation operator under the current subspace . Denote the MoG noise as
where , is the th entry of the vector and denotes the transformation with parameters on . The transformation can be an affine or projective transformation.
Similar as (9), we can get the MAP problem:
The key to solve this problem is to deduce the updating equation to . Since is a nonlinear geometric transform, it’s hard to directly optimize . So we consider to optimize the following reformulated problem:
where is the Jacobian of with respect to . After we get , we add it into to update transformation. This method iteratively approximates the original nonlinear transformation with a locally linear approximation [36, 41]. Like OMoGMF, we also use online-EM algorithm to solve the problem. The updating rule of MoG parameters can use Eq. (13) and (14) by changing into . And for and , we need to solve the following problem:
|Objective Function||Constraint/Basic Assumption||Implementation Scheme|
|Ebadi et al. ||Offline|
|t-GRASTA ||Online: Heuristically update|
By reformulating this weighted least square problem as:
we can directly get its closed-form solution as:
where . Through iteratively updating all involved parameters as aforementioned, the objective function (25) is monotonically increasing. After get the final outputs of these parameters, we can use similar strategy introduced in Sec. 3.3.2 to update the subspace .
The above transformed-OMoGMF (t-OMoGMF) algorithm is summarized in 2. To better illustrate the function of the transformation operator , we show in Fig. 5 the separated foregrounds from the low-rank subspace obtained by 2 during iterations in a frame along gradually transformed video sequence. It is seen that the frame can be automatically adjusted to be well-aligned by the gradually rectified transform operator . We list some typical transformed methods for background subtraction in Table II for easy comparison of their different properties.