Pixels values of infrared frames correspond to the relative differences in the amount of thermal energy emitted or reflected from objects in the scene. Due to this fact, infrared cameras are equally applicable for both day and night scenarios, while at the same time, compared to visual-optical cameras, are less affected by changing illumination or background texture. Furthermore, infrared imagery eliminates any privacy issues as people being depicted in the scene can not be identified . These features make infrared cameras prime candidate for persistent video surveillance systems.
Although, infrared imagery can alleviate several problems associated with visual-optical videos, it has its own unique challenges such as a) low signal-to-noise ratio (noisy data) and b) almost continuous pixel values that model objects’ temperature. Both issues complicate pixel responses modeling. An example of raw thermal responses is presented in Figure 1
, where pixel values are floating point numbers ranging from 293 to 299 Kelvin degrees. Due to this peculiarity most of conventional computer vision techniques, that successfully used for visual-optical data, can not be applied straightforward on thermal imagery.
, the task of background subtraction constitutes a key component, as this is one of the most common methods for locating moving objects, facilitating search space reduction and visual attention modeling.
and can be classified into three main categories: basic background modeling [19, 28], statistical background modeling [11, 27] and background estimation [20, 23]
. The most used methods are the statistical ones due to their robustness to critical situations. In order to statistically represent the background, a probability distribution is used to model the history of pixel values intensity over time. Towards this direction, the work of Stauffer and Grimson, is one of the best known approaches. It uses a Gaussian mixture model, with fixed number of components, for a per-pixel density estimate. Similar to this approach, Makantasis et al. in 
propose a Student-t mixture model for background modeling, taking advantage of Student-t distribution compactness and robustness to noise and outliers. The works of and  extend the method of  by introducing a rule based on a user defined threshold to estimate the number of components. However, this rule is application dependent and not directly derived from the data. Haines and Xiang in  address this drawback by using a Dirichlet process mixture model. Due to the computational cost of their method, the authors propose a GPU implementation. All of the aforementioned techniques present the drawback that objects’ color properties are highly affected by scene illumination, making the same object to look completely different under different lighting or weather conditions.
Although, thermal imagery can provide a challenging alternative for addressing the aforementioned difficulty, there exist few works for thermal data. The authors of [7, 8, 9] exploit contour saliency to extract foreground objects. Initially, they utilize a unimodal background modeling technique to detect regions of interest and then exploit the halo effect of thermal data for extracting foreground objects. However unimodal background modeling is not usually capable of capturing background dynamics. Baf et al. in  present a fuzzy statistical method for background subtraction to incorporate uncertainty into the mixture of Gaussians. However, this method requires a predefined number of components making this approach to be application dependent. Elguebaly and Bouguila in  propose a finite asymmetric generalized Gaussian mixture model for object detection. Again this method requires a predefined maximum number of components, presenting therefore limitations when this technique is applied on uncontrolled environments. Dai et al. in  propose a method for pedestrian detection and tracking using infrared imagery. This method consists of a background subtraction technique that exploits a two-layer representation (one for foreground and one for background) of infrared frame sequences. However, the assumption made is that the foreground is restricted to moving objects, a consideration which is not sufficient for dynamically changing environments. One way to handle the aforementioned difficulties is to introduce a background model, the parameters and the structure of which are directly estimated from the data, while at the same time it takes into account the specific properties of infrared imagery.
1.1 Our contribution
This work presents background modeling able to provide a per pixel density estimate, taking into account the special characteristics of infrared imagery, such as low signal-to-noise ratio. Our method exploits a Gaussian mixture model with unknown number of components. The advantage of such a model is that its own parameters and structure can be directly estimated from the data, allowing dynamic model adaptation to uncontrolled and changing environments.
An important issue in the proposed Gaussian mixture modeling concerns learning the model parameters. In our method, this is addressed using a variational inference framework to correspond the functional structure of the model with real data distributions obtained from the infrared images. Then, the Expectation-Maximization (EM) algorithm is adopted to fit the outcome of variational inference to real measurements. Updating procedures are incorporated to allow dynamic model adaptation to the forthcoming infrared data. Our updating method avoids of using heuristics by considering existing knowledge accumulated from previous data distributions and then it compensates this knowledge with current measurements.
Our overall strategy exploits a Bayesian framework to estimate all the parameters of the mixture model and thus avoiding over/under fitting issues. To compensate computational challenges arising from the non a priori known nature of the mixture model, we utilize conjugate priors and thus we derive analytical equations for model estimation. In this way, we avoid the need of any sampling method, which are computationally and memory inefficient. This selection makes our system suitable for online and real-time applications.
This paper is organized as follows: Section 2 formulates the Bayesian framework for mixture modeling. In Section 3 we present the procedure to analytically derive the solutions for estimating the distributions of model parameters. Section 4 describes the EM algorithm along with setting the priors and parameters initialization. Section 5 discusses the online updating mechanism and Section 6 presents the background subtraction task. In Section 7 experimental results are given and Section 8 concludes this work.
2 Gaussian mixture modeling
In this section we formulate the Bayesian framework adopted in this paper to analytically estimate all the parameters of the proposed Gaussian mixture model. For this reason, in section 2.1 we briefly describe the basic theory behind Gaussian mixture model, while in section 2.2 we describe the introduction of conjugate priors that assist us in yielding analytical model estimations as in Section 3.
2.1 Model fundamentals
The Gaussian mixture distribution can be seen as a linear superposition of Gaussian functional components,
where the parameters must satisfy for every and and is the number of Gaussian components. In the proposed mixture modeling, variable can take any natural value up to infinity. However, it is highly recommended to set the upper bound for less than the cardinality of the dataset, i.e. the number of observed samples. By introducing a -dimensional binary latent variable , such as and , the distribution can be defined in terms of a marginal distribution and a conditional distribution as follows
where and and are in the form of
where and , correspond to the mean values and precisions of Gaussian components. By introducing latent variables and transforming the Gaussian mixture distribution into the form of (2), we are able to exploit the EM algorithm for fitting our model to the observed data, as shown in Section 4.
2.2 Conjugate priors
To avoid computational problems in estimating the parameters and the structure of the proposed Gaussian model, we introduce conjugate priors, over the model parameters , and , that allow us to yield analytical solutions. This way the need of using sampling methods is prevented. Introduction of priors implies the use of a Bayesian framework for the analysis.
Let us denote as the set which contains all model latent variables and parameters and as its distribution. Then, our goal is to estimate which maximizes model evidence .
Due to the fact that is a Multinomial distribution, its conjugate prior is a Dirichlet distribution over the mixing coefficients
where is the Gamma function. Parameter has a physical interpretation; the smaller the value of this parameter is, the larger is the influence of the data rather than the prior on the posterior distribution . In order to introduce uninformative priors and not prefer a specific component against the other, we choose to use a single parameter for the Dirichlet distribution, instead of a vector with different values for each mixing coefficient.
Similarly, the conjugate prior of (4
) takes the form of a Gaussian-Gamma distribution, because bothand
are unknown. Subsequently, the joint distribution of parametersand can be modeled as
where denotes the Gamma distribution.
In order to not express any specific preference about the form of the Gaussian components through the introduction of priors, we use uninformative priors by setting the values of hyperparameters, , and to appropriate values as shown in Section 4.
Having defined the parametric form of observed data, latent variables and parameters distributions, our goal is to approximate the posterior distribution and the model evidence , where is the set with distribution , which contains all model latent variables and parameters. Based on the Bayes rule, which states that the logarithm of distribution can be expressed as
is the Kullback-Leibler divergence betweenand distributions and is the lower bound of . Since is a non negative quantity, equals to zero only if is equal to , maximization of is equivalent to minimizing of . For minimizing and estimating we exploit the EM algorithm, as shown in Section 4.
By making the assumption, based on variational inference, that the distribution can be factorized over disjoint sets such as , as shown in , the optimal solution corresponds to the minimization of is given by
where is the expectation of the joint distribution over all variables for and is a constant. is modeled through (12).
In the following, we present the analytical solution for the optimal distributions for model parameters and latent variables, i.e. the optimal distributions , , and .
3 Optimal model parameter distributions
), the joint distribution of all random variables can be factorized as follows
corresponds to the set of the observed variables. All proofs are given in Appendix A.
3.1 Optimal distribution
3.2 Optimal distribution
3.3 Optimal distribution
Similarly, the distribution of the optimized factor
is given by a Gaussian distribution of the form
where the parameters and are given by
where is equal to represents the centroid of the data that belong to the -th component.
3.4 Optimal distribution
After the estimation of , distribution of the optimized factor is given by a Gamma distribution of the following form
while the parameters and are given by the following relations
4 Distribution parameters optimization
After the approximation of random variables distributions, we will use the EM algorithm in order to find optimal values for model parameters, i.e. maximize (10). In order to use the EM algorithm, we have to initialize priors hyperparameters , , , and and the model parameters , , , , , and (see Section 3).
The parameter can be interpreted as the effective prior number of observations associated with each component. In order to introduce an uninformative prior for , we set the parameter equal to , suggesting that the same number of observations is associated to each component. Parameters and (positive values due to Gamma distribution) were set to the value of . Our choice is justified by the fact that the results of updating equations (19a) and (19b) are primarily affected by the data and not by the prior when the values for and
are close to zero. The mean values of the components are described by conditional Normal distribution with meansand precisions . We introduce an uninformative prior by setting the value for to the mean of the observed data and the parameter , where
is the variance of the observed data.
The convergence of EM algorithm is facilitated by initializing the parameters , , and
using the k-means. To utilize k-means, the number of clusters, corresponding to the Gaussian components, should be a priori known. Since we create a mixture model, the number of Gaussian components should be less or equal to the number of observed data. For this reason we set the number of clustersto a value smaller or equal to the number of observations. If we have no clue about the number of classes we can set to equal . If we denote as the number of observation that belong to -th cluster, then we can set the value of parameter to equal the centroid of -th cluster, the parameter to equal the proportion of observations for the -th cluster, the parameter to equal , where stands for the variance of the data of the -th cluster and the parameter to equal . Having initialized the parameters , , and , we can exploit the formula for the expected value of a Gamma distribution to initialize the parameters and to values and one respectively. Finally, the initialization of allows us to initialize the parameter , which can be interpreted as the effective number of observations associated with each Gaussian component, to the value .
After the initialization of model parameters and priors hyperparameters, the EM algorithm can be used to minimize of (10). During the E step, is calculated given the initial/current values of all the parameters of the model. Using (13b) is given by
Due to the fact that is a Dirichlet distribution and is a Gamma distribution, and will be given by
where is the digamma function.
During the M step, we keep fixed the value for variables (the value that was calculated during the E step), and we re-calculate the values for model parameters using (15), (17) and (19). The steps E and M are repeated sequentially untill the values for model parameters are not changing anymore. As shown in  convergence of EM algorithm is guaranteed because bound is convex with respect to each of the factors , , and .
During model training the mixing coefficient for some of the components takes value very close to zero. Components with mixing coefficient less than are removed (we require each component to model at least one observed sample) and thus after training the model has automatically determined the right number of Gaussian components.
5 Online updating mechanism
Having described how our model fits to observed data, in this section we present the mechanism that permits our model to automatically adapt to new observed data. We use no heuristic rules but statistics based on the observed data.
Let us denote as a new observed sample. Then, there are two cases; either the new observed sample is successfully modeled by our trained model, or not. To estimate if a new sample is successfully modeled, we find the closest component to the new sample. As a distance metric between components and the new sample, we use the Mahalanobis distance, since this is reliable distance measure between a point and a distribution.
The closest component to the new sample is the one that presents the minimum Mahalanobis distance
The probability of the new sample to belong to
where and stand for the closest component mean value and precision respectively.
Let us denote as the initially observed dataset. Then, we can assume that the probability to observe the new sample is given by
is a Uniform distribution with lower and upper bounds to equaland respectively. Equation (24) suggests that the probability to observe is related to the proportion of data that have already been observed around . By increasing the neighborhood around , i.e. increasing the value of , the quantity is decreasing, while the value of is increasing.
Upon arrival of a new sample , we can estimate the optimal range around that maximizes (24) as
Then, if the new observed sample can sufficiently represented by our model. Otherwise, a new Gaussian component must be created.
For model updating, we need to define the binary variable, called ownership and associated with the Gaussian components, as
where we recall that represents the index of the closest component and is the index of -th component.
When the new observed sample is successfully modeled, the parameters for the Gaussian components are updated using the following the leader  approach described as
where is equal to .
When the new observed sample cannot be modeled by the existing components, a new component is created with mixing coefficient , mean value, the parameters of which are given as
From (28), we see that the mixing coefficient for the new component is equal to since it models only one sample (the new observed one), its mean value equals the value of the new sample and its variance the variance the Uniform distribution, whose the lower and upper bounds are and respectively. When a new component is created the values for the parameters for all the other components remain unchanged except mixing coefficients , which are normalized to sum . After each adaptation of the system to new observed samples, either they modeled by the trained model or not, the mixing coefficients of the components are normalized to sum to one.
Figure 2 presents the adaptation of our model (first row) and the model presented in  (second row) to new observed data. To evaluate the quality of the adaptation of the models, we used a toy dataset with 100 observations. Observed data were generated from two Normal distributions with mean values 16 and 50 and standard deviations 1.5 and 2.0 respectively. The initially trained models are presented in the left column. Then, we generated 25 new samples form a Normal distribution with mean value 21 and standard deviation 1.0. Our model creates a new component and successfully fits the data. On the contrary, the model of  is not able to capture the statistical relations of the new observations and fails to separate the data generated from distributions with mean values 16 and 21 (middle column). The quality of our adaptation mechanism becomes more clear in the right column, which presents the adaptation of both models after 50 new observations.
|Algorithm 1: Overview of Background Subtraction|
|1: capture frames|
|2: create -length history for each pixel|
|3: initialize parameters (see Section 5)|
|4: until convergence (training phase: Section 4)|
|5: compute using (20)|
|6: recompute parameters using (15), (17) and (19)|
|7: for each new captured frame|
|8: classify each pixel as foreground or background|
|(see Section 6)|
|9: update background model (see Section 5)|
6 Background subtraction
In this section we utilize our model for background subtraction. We initially capture frames used to create an infrared responses history for each pixel. These histories act as observed data and used to train a model for each pixel. To classify a pixel of a new captured frame as background or foreground, we compute the probability its value to be represented by the mixture model. If this value is larger than a threshold the pixel is classified as background, otherwise it is classified as foreground. The threshold can be defined in relation to the parameters of the mixture, in order to be dynamically adapted. For example, we can define the threshold to be equal to where
is a scalar defining a confidence interval. The overview of the background subtraction algorithm is shown in Algorithm 1.
7 Experimental results
For evaluating our algorithm, we used the Ohio State University (OSU) thermal datasets and a dataset captured at Athens International Airport (AIA) during Evacuate111http://www.evacuate.eu/ European funding project. OSU datasets contain frames that have been captured using a thermal camera and have been converted to grayscale images. In contrast, the AIA dataset contains raw thermal frames whose pixel values correspond to the real temperature of objects.
OSU datasets [7, 8, 9] are widely used for benchmarking algorithms for pedestrian detection and tracking in infrared imagery. Videos were captured under different illumination and weather conditions. AIA dataset was captured using a Flir A315 camera at different Airside Corridors and the Departure Level. Totally, 10 video sequences were captured, with frame dimensions pixels of total duration 32051 frames, at 7.5fps, that is, about 1h and 12mins. During experimentation process we used the sequence captured at the Departure Level Entrance 3 which provides a panoramic view of the space. The other sequences at corridors, due to narrow space perspective and the fact that videos were captured by mounting the camera at human height level, are inappropriate for testing background subtraction algorithms.
We compared our method with the method presented by Zivkovic in  (MOG), which is one of the most robust and widely used background subtraction technique, and with the method for extracting the regions of interest presented in [8, 9] (SBG) used for thermal data. To conduct the comparison we utilized the objective metrics of recall, precision and F1 score on a pixel wise manner. Figures 3 visually present the performance of the three methods. As is observed, our method outperforms both MOG and SBG on all datasets. While MOG and SBG perform satisfactory on grayscale frames of OSU datasets, their performance collapses when they applied on AIA dataset, which contains actual thermal responses. Regarding OSU datasets, MOG algorithm while presents high precision it yields very low recall values, i.e. the pixels that have been classified as foreground are indeed belong to the foreground class, but a lot of pixels that in fact belong to background have been misclassified. SBG algorithm seems to suffer by the opposite problem. Regarding AIA dataset, our method significantly outperforms both methods. In particular, while MOG and SBG algorithms present relative high precision, their recall values are under 0.2. Figure 4(a) presents average precision, recall and F1 score per dataset and per algorithm for all frames examined to give an objective evaluation. In Figure 4(b) presents the best and worst case in terms of precision, recall and F1 score among all frames examined.
Regarding computational cost, the main load of our algorithm is in the implementation of EM optimization. In all experiments conducted, the EM optimization converges within 10 iterations. Practically, the time required to apply our method is similar to the time requirements of Zivkovic’s method making it suitable for real-time applications.
This paper presents a background subtraction method applicable to thermal imagery, based on Gaussian mixture modeling with unknown number. We analytically derive the solutions that describe the parameters of the model and we use the EM optimization to estimate their values, avoid sampling algorithms and high computational cost. Due to its low computational cost and the real-time operation, our method is suitable for real-world applications.
Appendix A Appendix
Due to the fact that there is no term in (31b) that contains parameters from both sets and , the distribution can be factorized as . The distribution for is derived using only those terms of (31b) that depend on the variable . Therefore the logarithm of is given by
We have made use of , and we have denote as . (32c) suggests that is a Dirichlet distribution with hyperparameters .
Using only those terms of (31b) that depend on variables and , the logarithm of is given by
For the estimation of , we use (33) and keep only those factors that depend on .
where , and .
After the estimation of , logarithm of the optimized the distribution is given by
The parameters and are given by
-  C. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, Oct. 2007.
-  S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Mar. 2004.
-  S. Brutzer, B. Hoferlin, and G. Heidemann. Evaluation of background subtraction techniques for video surveillance. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1937–1944, June 2011.
-  S.-C. S. Cheung and C. Kamath. Robust background subtraction with foreground validation for urban traffic video. EURASIP Journal on Advances in Signal Processing, 2005(14):726261, Aug. 2005.
-  C. Dai, Y. Zheng, and X. Li. Pedestrian detection and tracking in infrared imagery using shape and appearance. Computer Vision and Image Understanding, 106(2–3):288–299, May 2007.
-  S. Dasgupta and D. Hsu. On-line estimation with the multivariate gaussian distribution. In Proceedings of the 20th Annual Conference on Learning Theory, COLT’07, pages 278–292, Berlin, Heidelberg, 2007. Springer-Verlag.
-  J. Davis and V. Sharma. Fusion-based background-subtraction using contour saliency. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, 2005. CVPR Workshops, pages 11–11, June 2005.
-  J. W. Davis and V. Sharma. Robust background-subtraction for person detection in thermal imagery. In Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’04) Volume 8 - Volume 08, CVPRW ’04, pages 128–, Washington, DC, USA, 2004. IEEE Computer Society.
-  J. W. Davis and V. Sharma. Background-subtraction in thermal imagery using contour saliency. International Journal of Computer Vision, 71(2):161–181, Feb. 2007.
-  F. El Baf, T. Bouwmans, and B. Vachon. Fuzzy statistical modeling of dynamic backgrounds for moving object detection in infrared videos. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009. CVPR Workshops 2009, pages 60–65, June 2009.
-  A. Elgammal, D. Harwood, and L. Davis. Non-parametric model for background subtraction. In D. Vernon, editor, Computer Vision — ECCV 2000, number 1843 in Lecture Notes in Computer Science, pages 751–767. Springer Berlin Heidelberg, Jan. 2000.
-  T. Elguebaly and N. Bouguila. Finite asymmetric generalized gaussian mixture models learning for infrared object detection. Computer Vision and Image Understanding, 117(12):1659–1671, Dec. 2013.
-  R. Gade, A. Jorgensen, and T. Moeslund. Long-term occupancy analysis using graph-based optimisation in thermal imagery. In 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3698–3705, June 2013.
-  T. Haines and T. Xiang. Background subtraction with DirichletProcess mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(4):670–683, Apr. 2014.
-  S. Herrero and J. Bescós. Background subtraction techniques: Systematic evaluation and comparative analysis. In Proceedings of the 11th International Conference on Advanced Concepts for Intelligent Vision Systems, ACIVS ’09, pages 33–42, Berlin, Heidelberg, 2009. Springer-Verlag.
-  K. Jungling and M. Arens. Feature based person detection beyond the visible spectrum. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009. CVPR Workshops 2009, pages 30–37, June 2009.
-  L. Latecki, R. Miezianko, and D. Pokrajac. Tracking motion objects in infrared videos. In IEEE Conference on Advanced Video and Signal Based Surveillance, 2005. AVSS 2005, pages 99–104, Sept. 2005.
-  K. Makantasis, A. Doulamis, and N. Matsatsinis. Student-t background modeling for persons’ fall detection through visual cues. In 2012 13th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pages 1–4, May 2012.
-  N. J. B. McFarlane and C. P. Schofield. Segmentation and tracking of piglets in images. Machine Vision and Applications, 8(3):187–193, May 1995.
S. Messelodi, C. M. Modena, N. Segata, and M. Zanin.
A kalman filter based background updating algorithm robust to sharp illumination changes.In F. Roli and S. Vitulano, editors, Image Analysis and Processing – ICIAP 2005, number 3617 in Lecture Notes in Computer Science, pages 163–170. Springer Berlin Heidelberg, Jan. 2005.
-  F. Porikli. Achieving real-time object detection and tracking under extreme conditions. Journal of Real-Time Image Processing, 1(1):33–40, Mar. 2006.
-  C. Stauffer and W. Grimson. Adaptive background mixture models for real-time tracking. In Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on., volume 2, pages –252 Vol. 2, 1999.
-  K. Toyama, J. Krumm, B. Brumitt, and B. Meyers. Wallflower: principles and practice of background maintenance. In The Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999, volume 1, pages 255–261 vol.1, 1999.
-  O. Tuzel, F. Porikli, and P. Meer. Human detection via classification on riemannian manifolds. In IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR ’07, pages 1–8, June 2007.
-  O. Tuzel, F. Porikli, and P. Meer. Pedestrian detection via classification on riemannian manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10):1713–1727, Oct. 2008.
-  W. Wang, J. Zhang, and C. Shen. Improved human detection and classification in thermal images. In 2010 17th IEEE International Conference on Image Processing (ICIP), pages 2313–2316, Sept. 2010.
-  C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):780–785, July 1997.
-  J. Zheng, Y. Wang, N. Nihan, and M. Hallenbeck. Extracting roadway background image: Mode-based approach. Transportation Research Record: Journal of the Transportation Research Board, 1944:82–88, Jan. 2006.
-  Z. Zivkovic. Improved adaptive gaussian mixture model for background subtraction. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, volume 2, pages 28–31 Vol.2, Aug. 2004.
-  Z. Zivkovic and F. van der Heijden. Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognition Letters, 27(7):773–780, May 2006.