High level computer vision applications, ranging from video surveillance and monitoring to intelligent vehicles, utilize visible spectrum information. However, under certain environmental conditions, this type of sensing can be severely impaired. This emerges the necessity for imaging beyond the visible spectrum, exploiting thermal sensors, which are equally applicable both for day and night scenarios, while at the same time are less affected by illumination changes.
However, thermal sensors present their own unique challenges. First, they have low signal-to-noise ratio (SNR) implying the existence of noisy data. Second, there is a lack of color and texture information deteriorating visual content interpretation . Third, objects are not thermally homogeneous and are exposed to variations of sun illumination . All these issues complicate pixel modeling especially when applying for object categorization and foreground/background detection.
For many high level applications, either they use visual-optical videos [3, 4] or thermal data [1, 5, 6, 7], the task of background subtraction constitutes a key component for locating moving objects . The most common approach to model the background is to use mixtures of Gaussians, the number of which is assumed to be a priori known. While such assumption is valid for sensors capturing the visible spectrum, mainly due to their ultra high resolution accuracy, that is, high SNR, they are inappropriate for thermal data. Selection of a large number of components results in modeling the noise and therefore it reduces discrimination performance. On the contrary, low number of components yields approximate modeling that fails to capture the complexity of a scene. Consequently, methods that automatically estimate the most suitable components to fit the statistics of thermal data are important towards an efficient background subtraction scheme.
Furthermore, to increase the penetration of thermal sensors to the surveillance market, embedded acceleration methods are needed. This means that the background subtraction algorithms should be re-designed to be implemented in low power devices. This way, the benefits are twofold. First, we offload a significant computation load near the source of the data, and thus, bandwidth is saved as only the region of interest (or an event) is transmitted. Second, costly operations are executed in low-power embedded hardware saving valuable resources.
I-a Related Work
Background subtraction techniques applied on visual-optical videos model the color properties of objects 
and can be classified into three main categories: basic background modeling, statistical background modeling and background estimation. The most used methods are the statistical ones due to their robustness to critical application scenarios.
One common approach for the statistical modeling of visual content is based on the exploitation of Gaussian Mixture Models (GMMs). In the work of GMMs are utilized to create space-time representations in video segment level. However, the specific task of background subtraction requires the estimation of a pixel level background model.
proposes a Student-t mixture model improving compactness and robustness to noise and outliers. The work of proposes a background subtraction algorithm based on GMMs with only one user-tunable parameter. In  a GMM-based background modeling that incorporates incremental EM type of learning is proposed in order to improve convergence speed and stability of learnt models. These works assume that the number of components is a priori known. Following this assumption, the intensities of all pixels are represented using GMMs, all of the same number of components. However, in many real world applications this assumption can be very restrictive, because the intensities of different pixels may need different number of components for being accurately modeled.
The works of  and  extend the method of  by introducing a user defined threshold to estimate the number of components. However, this rule is application dependent and not directly derived from the data. Another extension of  is proposed in , where a patch-wise background model is estimated using dynamic texture. However, patch level decisions may produce very coarse results when low resolution videos, such as the thermal ones, need to be processed.
An alternative approach that makes no a priori assumptions on the number of components is presented in the work of . This work proposes a recursive density approximation method that relies on the propagation of density modes, which are detected by using the mean shift. Although, mean shift is a great nonparametric technique, it is computationally intensive and thus not suitable for low power hardware implementations.
The work of  proposes the exploitation of a Dirichlet Process Mixture Model (DPMM). This method automatically estimates the number of components by utilizing sampling techniques. However, sampling algorithms are computational intensive and memory inefficient and thus inappropriate for real-time applications, as the ones we focus on. To address this problem, the authors of  suggest a GPU implementation.
Another approach for modeling data distributions using GMMs is presented in . This work focuses on the general problem of density estimation. Based on the property of GMMs to approximate arbitrarily close any distribution, it estimates the true distribution by creating a sufficiently large number of components, which may be very ”close”. For density estimation problem, creating such components is not an issue since it may increase approximation accuracy. However, when one needs to design and develop an algorithm for low power hardware devices, as in our case, then this algorithm should keep in memory as few as possible parameters.
Finally, when someone knows some important features of foreground and/or background objects, supervised learning techniques can be utilized. For example in the work of a supervised learning approach is proposed for discriminating fire than the background. However, in the general problem of background subtraction it is very uncommon to known some specific features of foreground and/or background in advance.
Techniques that use visual/optical data present the drawback that objects’ properties are highly affected by scene illumination, making the same object to look completely different under different lighting or weather conditions. Although, thermal imagery can provide a challenging alternative for addressing this difficulty, there exist few works for thermal data.
The authors of [23, 2] exploit contour saliency and a unimodal background modeling technique to extract foreground objects. However, unimodal models are not usually capable of capturing the background dynamics and its complexity. Baf et al. in  present a fuzzy statistical method for background subtraction to incorporate uncertainty into the GMM. Elguebaly and Bouguila in  propose a finite asymmetric generalized Gaussian mixture model for object detection. However, both of these methods require a predefined maximum number of components, presenting therefore limitations when they are applied on uncontrolled environments.
Dai et al. in  propose a method for pedestrian detection and tracking using thermal imagery. This method consists of a background subtraction technique that exploits a two-layer representation (foreground/background) of frame sequences. However, they assume that the foreground is restricted to moving objects, a consideration which is not sufficient for dynamically changing environments. One way to handle the aforementioned difficulties is to introduce a background model, the parameters and the structure of which are directly estimated from the data, while at the same time it takes into account the specific characteristics of thermal imagery.
The computational cost, and thus the performance, of a background subtraction algorithm is always an issue as it usually performs poor in CPUs. One of the first attempts for real-time performance was the work of  implemented in SGI O2 computer. Since then, many implementations in GPUs were proposed. In  and  implementations based on the model of  achieving real-time performance even for High-Definition (HD) resolutions are presented. In the work of  the authors managed to accelerate the algorithm of  up to 1080p resolution of 60fps. However, GPUs cannot be considered low power devices, which can be seen as a disadvantage especially for 24/7 surveillance systems.
This gap is addressed from Field Programmable Gate Arrays (FPGA) accelerators. In the work of  and , a real-time video surveillance system using a GMM, that also handles memory bandwidth reduction requirements, is proposed. Other approaches such as the work of  propose accelerators of the GMM algorithm in reconfigurable hardware reaching 24fps for HD video. The same authors claim even better performance of 91fps in HD video in their improved work of  if a Xilinx Virtex 4 device is used. However, the main limitation to achieve this performance is the memory bandwidth which becomes the main bottleneck in the pipeline. The requested bandwidth for this performance is about 8GB/sec where FPGA boards usually hold 64bit-wide Dynamic Random Access Memory (DRAM) clocked in a range of 100-200 MHz. As a result the memory subsystem can support at least one order of magnitude lower bandwidth. This means that we need technologies for reducing memory requirements in case that the background subtraction algorithm is adapted to be implemented under reconfigurable hardware architectures.
I-B Our Contribution
The main contribution of this paper is the design of a background subtraction system (as a whole) that is completely data driven, takes into account the specific characteristics of thermal imagery and is suitable for implementation in low power and low memory hardware.
. Our method exploits GMMs with unknown number of components, which are dynamically estimated directly from the data. In particular, the Variational Inference framework is adopted to associate the functional structure of the model with real data obtained from thermal images. Variational Inference belongs to the class of probability approximation methods, which try to estimate the approximate posterior distribution by minimizing the KL-divergence between the approximate and the true posteriors. As it has been shown in and , when the number of samples tends to infinity the lower variational bound approaches the BIC criterion for model selection. When someone targets low power hardware devices, and thus must keep in memory as few as possible parameters, this feature is very important, since through Variational Inference the true distribution is approximated without over/under fitting (creates the ”right” number of components).
The adopted approach, instead of treating the mixing coefficients of the components as single parameters, it considers them as probability distributions. Under this framework, we need to estimate forms of probability distributions that best fit data properties, instead of fitting an a priori known number of components to the captured data. Then, the Expectation-Maximization (EM) algorithm is adopted to estimate model parameters. To compensate computational challenges we utilize conjugate priors for deriving analytical equations for model estimation. Updating procedures are incorporated to allow dynamic model adaptation. Our updating method avoids the use of accumulated data from previous time instances, resulting in low memory requirements. Such a scheme assists the implementation of an in-camera module suitable for devices of low power and memory demands.
This paper is organized as follows: Section II introduces the Variational Inference framework, while Section III describes the algorithm for optimally estimating the model parameters. In Section IV, we present the EM optimization that best fits model parameters to the data. A threshold independent on-line updating algorithm is introduced in Section V. The in-camera reconfigurable architecture is discussed in Section VI-B, while experimental results are presented in Section VII.q Finally, Section VIII draws the conclusions of the paper.
Ii Variational Inference for Gaussian Mixture Modeling
Ii-a Gaussian Mixture Model Fundamentals
The Gaussian mixture distribution has the following form;
where and and are in the form of
represents the Gaussian distribution,is the number of components, variables refer to the mixing coefficients that represent the proportion of data that belong to each component and which satisfy and . Variable corresponds to the intensity of a pixel (i.e., the observed variable) and , stand for the mean values and precisions of the Gaussian components respectively. The
-dimensional vectoris a binary latent variable in which a particular element is equal to one and all other elements are equal to zero, such as and . This vector is related to the number of components that are used for modeling pixels intensities. In the work of  the value of is assumed to be a priori known, while in our case, this value is estimated directly from the data.
Eq.(1) models the effect of one sample. Given a set of pixel intensities (i.e., observed data), we conclude to a set of latent variables, . Each is a -dimensional binary vector with one element equals one and all the others equal zero, such as . Then, Eq.(2) and Eq.t(3) are transformed to
The goal is to estimate a background model exploiting pixel intensities, that is, to calculate the values of , and , involved in the probability .
Ii-B Distribution Approximation through Variational Inference
, which exploit the k-means algorithm. For many real-life application scenarios, as the one this paper targets, it is better to let variablefit the statistics of the data (i.e., let variable be unknown). In such cases, one way to estimate is to apply computationally expensive methods through sampling algorithms or to build several models of different number of components and then select the best one. An alternative computationally efficient approach, adopted in this paper, is to exploit the Variational Inference framework. More specifically, instead of treating the mixing coefficients as single parameters, which requires the knowledge of , we treat them as probability distributions. This way, we are able to estimate the coefficients independently from . Such an approach keeps computational complexity low since it avoids sampling methods or experimentation on different number of components.
Let us denote as a set which contains model parameters and the respective latent variables. Let us also denote as the variational distribution of . Our objective is to estimate to be as close as possible to for a given observation
. Regarding similarity between two distributions, the Kullback-Leibler divergence,
is used. has to be minimized since it is a non negative quantity, which equals zero only if .
In the context of the most common type of Variational Inference, known as mean-field approximation, the variational distribution is assumed to be factorized over disjoint sets such as . Then, as shown in , the optimal solution that minimizes metric is given by
is the expectation of the logarithm of the joint distribution over all variables that do not belong to thepartition and is a constant. Eq.(7) indicates the presence of circular dependencies between the variables that belong to different partitions. Thus, estimating the optimal distribution over all variables suggests the exploitation of an iterative process such as the EM algorithm (see Section IV).
Iii Optimal distributions over model parameters
In this section, we present the analytical form for the optimal distributions , considering the model coefficients and the latent variables; i.e., , , and . For simplifying the notation, in the following the superscript of optimal distributions and the subscript for the partition are omitted.
Iii-a Factorized Form of the Joint Distribution
To estimate , we require to rewrite the right hand of Eq.(7), that is, the joint distribution , as a product
The distributions and are already known from Eq.(5) and Eq.(4). Thus, we need to define the prior distribution and the joint distribution . In this paper, conjugate priors are adopted to estimate the distributions and . Such an approach is computational efficient since it avoids implementation of the expensive sampling methods yielding analytical solutions.
We start our analysis by the prior distribution . In particular, since has the form of a multinomial distribution, [see Eq.(4)], its conjugate prior is given by
Eq.(9) is a Dirichlet distribution with stands for the Gamma function and scalar a control parameter. The smaller the value of is, the larger the influence of the data rather than the prior on the posterior distribution . The choice of setting the parameter as a scalar instead of a vector of different values for each mixing coefficient, is due to the fact that we adopt an uninformative prior framework, that is not preferring a specific component against the others.
) takes the form of a Gaussian-Gamma distribution since bothand are unknown. Subsequently, the joint distribution can be modeled as
where denotes the Gamma distribution. Again, an uninformative prior framework is adopted meaning that no specific preference about the form of the Gaussian components is given. The parameters , , and are discussed in Section IV-C. In the following, the forms of optimal variational distributions are presented using the results from Appendix A.
Iii-B Optimal Distribution
Iii-C Optimal Distribution
Iii-D Optimal distribution
Similarly, the variational distribution of the optimized factor is given by a Gaussian distribution of the form
where the parameters and are given by
Variable is equal to and represents the centroid of the data that belong to the -th component.
Iii-E Optimal distribution
After the estimation of , the variational distribution of the optimized factor is given by a Gamma distribution of the following form
while the parameters and are given as
Iv Distribution Parameters optimization
In Section III
, we derive approximations of the random variable distributions. While the works of and  adopt k-means algorithm to approximately estimate the parameters of the background model, in this work, the EM algorithm is employed to optimally estimate the coefficient distributions that best fit the observations.
Iv-a The EM Optimization Framework
E-Step: Let us assume the -th iteration of the EM optimization algorithm. Then, during the E-step, only the value of is readjusted according to the statistics of the currently available observed data. Variable actually expresses the degree of fitness of the -th datum to the -th cluster, as derived from Eq.(11). Due to the fact that is a Dirichlet distribution and is a Gamma distribution, the following holds
where is the digamma function. In Eq.(18), we set and to simplify the notation of the following equations. Then,
M-Step: During the M-step, we keep fixed the value of , as it has been calculated through the E-Step. Then, we update the values of the background model coefficients, which will allow us to re-estimate the degree of fitness at the next iteration stage, exploiting Eq.(19).
The distribution of the model coefficients is computed based on Eq.(13). The value is given in Section IV-C. We recall that in our approach, the number of components that the background content is composed to is not a priori known. For this reason, the mixing coefficients of the background model are treated as probability distributions and not as single parameters. Due to this fact, we can initialize the EM algorithm by setting the number of components to be smaller than or equal to a maximum value, coinciding with the number of observed data, that is, . Then, the probability coefficients distributions re-arrange the number of components, in order to best fit the statistics of the observations. This is achieved through EM optimization.
Next, the distribution is updated exploiting . In order to do this, we need to update and based on Eq.(15). The E and M steps are repeated sequentially until the values for model parameters are not significantly changing. As shown in  convergence of EM algorithm is guaranteed because bound is convex with respect to each of the factors , , and .
During model training the mixing coefficient for some of the components takes value very close to zero. Components with mixing coefficient less than are removed (we require each component to model at least one observed sample) and thus after training, the model has automatically determined the right number of Gaussian components.
Iv-B Initialization Aspects
The k-means++ algorithm  is exploited to initialize the EM algorithm at . The k-means++ presents advantages compared to the conventional k-means, since it is less depended on initialization. It has to be mentioned that fuzzy versions of k-means are not suitable for the initialization process, since each sample should belong to exactly on cluster/component. The k-means++ creates an initial partition of the data used to initialize EM algorithm. Then, at the updating stages of the algorithm (Section IV-A), the probabilities of each observed datum to belong to one of the available clusters, expressed through , are updated. This way, the final number of clusters are dynamically refined according to the statistical distributions of the image pixel intensities.
Let us denote as the number of observations that belong to -th cluster at iteration . Then, an initial estimate of the mixing coefficients is , meaning that the significance of the -th component is proportional to the number of data that belong to the -th cluster. Thus, the initialization of , [see Eq.(13)] expresses the number of observations associated with each component. The parameters , , and are initially estimated from Eq.(17) and Eq.(15), considering the knowledge of the priors parameters as discused in Section IV-C. Finally, the model parameter
is given as inverse proportional of the variance of the data of the-th initial cluster, that is, .
Iv-C Priors Parameters
The parameter in Eq.(9) can be interpreted as the effective prior number of observations associated with each component. However, we do not have any prior information regarding this number. In order to use an uninformative prior and maximize the influence of the data over the posterior distribution we set , see .
Relations Eq.(17) and Eq. (17b) suggest that the values of parameters and are primarily affected by the data and not by the prior, when the values of the parameters and are close to zero. For this reason we set and to a very small value ( in our implementation).
Similarly, we initialize as the mean value of the observed data and precision , where is the variance of the observed data. We use uninformative priors, since we do not have any information regarding neither the number of components nor their true mean and variance values.
V Online Updating Mechanism and Background Subtraction
Using the aforementioned approach, we fit a model to the background considering a pool of observed data. In this section, an adaptive strategy that is threshold-independent and memory efficient is presented. Such an approach permits implementation of the proposed algorithm to an in-camera embedded hardware of low power and memory requirements. This way we deliver thermal sensors embedding with the capability of detecting moving objects in real-time. Furthermore, by exploiting the updating mechanism the presented system can online process streams of frames yielding a small computational time. So, in a sense, it can handle big data volumes.
Let us denote as a new observed sample. Then, a decision is made whether can be approximated by our best fitted model or not. For this reason, the best matched Gaussian component to is estimated by minimizing the Mahalanobis distance ,
where and stand for the mean and precision of the -th component. We use Mahalanobis distance, since this is a reliable distance measure between a point and a distribution. Then, belongs to with probability
V-a Threshold Independent
Conventionally, Eq.(22) implies a threshold to determine the probability limit over which the new sample belongs to . To overcome threshold limitations, the following adaptive approach is adopted in this paper.
Let us denote as the image pixel responses over a fixed time span. Then, we model the probability to observe the new sample in a region of range centered at as
where is the cardinality of the set that contains samples -close to and
is a Uniform distribution with lower and upper bounds that equal toand . respectively.
Eq.(23) suggests that the probability to observe the is related to the portion of data that have been already observed around . By increasing the neighborhood around , i.e., increasing the value of , the quantity is decreasing, while the value of is increasing. Therefore, we can estimate the optimal range around that maximizes Eq. (23)
Based on the probabilities and , which are exclusively derived by the observations, we can define our decision making mechanism. Concretely, if
the new observed sample can sufficiently represented by our model, i.e., the value of the new observed sample is sufficiently close to an existing Gaussian component. Otherwise, a new Gaussian component should be created, since the value of will not be close to what the model has already learnt.
V-B Model Updating
When the value of the new observed sample is sufficiently close to an existing Gaussian component, the parameters of the mixture are being updated using the following the leader  approach described as
where is the variance of the
-th component. The binary variabletakes value one when and zero otherwise.
When the new observed sample cannot be modeled by any existing component, i.e. the value of the new sample will not be close to what the model has already learnt [see Eq.(25)], a new component is created with mixing coefficient , mean value, defined as
Variable is estimated using the variance of the Uniform distribution. From Eq.(27), we derive that since it models only one sample (the new observed one), its mean value equals the value of the new sample and its variance the variance of the Uniform distribution, whose the lower and upper bounds are and respectively. When a new component is created the values for the parameters for all the other components remain unchanged except from the mixing coefficients which are normalized to sum . Then, the components whose mixing coefficients are less than are removed and the mixing coefficients of the remaining components are re-normalized.
V-C Memory Efficient Implementation
The main limitation of the aforementioned threshold independent approach is that it requires the storage of several observations, in order to reliably estimate the probability . In this section, we introduce a framework of updating the model parameters without the need of storing observations. This reduces memory requirements, and thus, it is a crucial step towards implementing our proposed system on devices of low power and memory requirements.
We recall that we have denoted as the closest component, in terms of Mahalaobis distance, to the new observed datum . This component is a Gaussian distribution with mean value , precision and mixing coefficient . Therefore, the quantity can be approximated as
the cumulative Gaussian distribution of the closest Gaussian component and using Eq.(28), is equal to
Then, the probability is approximated as
Probability is a continuous and unimodal function. Therefore, can be found either by setting the first derivative of Eq.(31) equal to zero or by using a numerical approach according to which the epsilon value is increased using ”sufficiently” small steps in order to approximate the point where the curvature of Eq.(31) changes. This point indicates the optimal value of epsilon. After the estimation of , we can compute . Thus, we are able to update the mixture model by comparing to .
V-D Background Subtraction
Let us denote as and the classes of background and foreground pixels respectively. The aforementioned modeling process actually approximates the probability . However, our goal is to calculate the probability , in order to as foreground or background a set of observations. Hence the Bayes rule is applied;
Then, the foreground object is derived through a subtraction process. The unknown factors of Eq.(32) are and . The probability
corresponds to the prior probability of background class. In our case, we have set it to be larger than, since the number of pixels that belong to the background class is larger than the number of pixel that belong to the foreground class. The probability is modeled using a uniform distribution as in . Thus, at arbitrary value of is , since can take arbitrary integer values between 0 and 255. The overview of the proposed scheme is shown in Algorithm 1.
Following this approach, our method avoids outliers by assigning them to components with very low weight. This way outliers are not practically considered during background subtraction, since will be close to zero when is an outlier. Furthermore, by exploiting the proposed online adaptation mechanism, components assigned to outliers will be discarded after the capturing of a few new frames, because their weight will be smaller than .
|Algorithm 1: Overview of Background Subtraction|
|1: capture frames|
|2: create -length history for each pixel|
|3: initialize parameters (see Section IV)|
|4: until convergence (training phase: Section IV)|
|5: compute using (19)|
|6: recompute parameters using (13), (15) and (17)|
|7: for each new captured frame|
|8: classify each pixel as foreground or background|
|(see subection V-D)|
|9: update background model (see subection V-C)|
V-E Interesting Cases
V-E1 Branches sway in the wind
When, at some part of the scene, there are tree branches sway in the wind, the intensities of the pixels that depict this part will be clustered into two (or more) different clusters; intensities of the branches and intensities of the sky. In such cases, conventional methods that utilize a fixed number of components for representing the underlying data distribution may encounter serious problems. On the contrary, the proposed approach can estimate the number of components directly from the data. Therefore, the aforementioned clusters will be correctly identified and the corresponding pixels will be considered as background.
V-E2 Switching from background to foreground and vice versa
Consider the following scenario; a foreground object, let us say a pedestrian, appears at pixel at time and stays there, standing still, before leaving at time . Then, he/she returns back at the same location at time . We want to discuss what will be the behavior of the proposed method regarding times and . In order to provide a formal explanation, we have to employ i) the history of pixel’s intensities, ii) relations Eq.(25), Eq.(27), Eq.(28) and Eq.(33), iii) the parameter and iv) the threshold for considering a pixel to belong to the background. We consider that the length of pixel’s history equals , the equals and the threshold for considering a pixel to belong to the background equals .
Consider that a pedestrian appears at pixel at frame (assuming that we use a camera with constant fps rate, we measure time in frames and not in seconds, this way the analysis is camera independent). A new component will be created for that pixel. The new component will be initialized using Eq.(28). Then, the pedestrian stays there (standing still) for the next frames. The following figures depict the evolution of the output of Eq.(33) using different values for .
The axis in Fig.1 corresponds to the new captured frames. As we can see, if we use a value close to for , then the pixel will be considered as background after 32 frames. If we increase the value for then much fewer frames are needed for considering the same pixel as background, because the prior belief that a pixel belongs to the background class is larger.
Now consider that the pedestrian decides to leave this location. Then there are two different cases. In the first case, the pedestrian starts moving before he/she will be considered as background. In that case, the system will be correctly considering the pedestrian as foreground object for the whole period started at . In the second case, the pedestrian decides to leave frames later after the time when pixel considered background. Since the rate of increment of the component’s weight is the same as the rate of decrement, and due to the fact that , additional frames will be required before our system consider the pixel as background.
V-E3 Sensor noise and flickering intensities
The sensor noise, typically, is zero mean, Gaussian and additive. In this case the noise will slightly affect the intensity of the pixels around the mean value. Therefore, variations due to this kind of noise will be captured by the components of the proposed GMM. On the other hand, flickering of intensities and/or salt and pepper sensor noise will indeed result to consider individual pixels as foreground. In this case we remove foreground blobs whose area is smaller than a threshold. During the evaluation of the proposed method, this threshold was manually optimized for each dataset and this post processing step applied on all algorithms that our method is compared against.
Vi In-Camera Acceleration Architecture
In this section, we describe in detail the hardware architecture for the proposed background subtraction algorithm. We call the proposed parallel implementation as Background Subtraction Parallel System (BSPS). BSPS exploits the reconfigurable resources of today’s FPGA devices.
Vi-a High Level Architecture
Since the proposed approach makes no assumption for pixel relationships, the background model for all pixels can be computed independently. Thus, our hardware architecture consists of many parallel cores in a scalable configuration, each processing a single pixel. In Section VII we demonstrate two configurations; one low cost, featuring a 4-core BSPS Engine and a second one featuring a 16-core BSPS Engine.
Each of the cores is connected to a shared bus in order to get the processing data from the external DRAM (or memory mapped camera module) of a host system. The data loading is performed in batches of up to 16 pixels as shown in Fig.2.
All operations are per pixel with no dependencies between them. Thus, using a buffering scheme utilizing simple FIFOs, we can hide the latency of the external DRAM and make our scheme working seamlessly as a streaming accelerator. However, it has to be mentioned that the parallelization of the algorithm, or the overall performance in general, does not actually depend on the buffering scheme, which in our case prevents possible ”data starvation” from the outside. The operations regarding data loading and write-back are fully pipelined. More details regarding the bandwidth demands are given in Section VII. The output of each core is a probabilistic classification for the corresponding pixel (background or foreground) and the updated parameters of the background model.
Vi-B System Organization
The BSPS comprises of two basic sub-modules: the Model Estimation Unit (MEU), which is depicted in Fig.3 and the Background Subtraction Unit (BSU) depicted in Fig.4. The MEU is activated just once at the initialization process of the system. It is responsible for building the proposed background model at all pixel locations. It uses a small history of pixel values and automatically estimates the appropriate number of Gaussian components along with their mixing coefficients. After the model is built, the MEU stores the model parameters to the external DRAM for each one of the image pixels.
Then and during the normal operation of the system, only the BSU is activated. The BSU takes as input the pixel data stream along with the model parameters and gives as output the probabilistic segmentation of each pixel, while it also updates model parameters. This way for each pixel a background model is maintained and updated, which is utilized for the classification of all the new incoming pixels.
Vi-C The Model Estimation Unit
One of the key advantages of the proposed scheme is that it does not require any prior knowledge about the structure of the background model in order to achieve an optimal operation. The MEU, depicted in Fig.3, is responsible for this task. It builds an accurate model for the specific background and its inherent temporal characteristics. It takes as input a small history of pixel responses () at a specific pixel location and outputs an accurate background model for this pixel. As mentioned in Section IV, in this module two basic algorithms are utilized; the k-means++ and the EM algorithm. Around 100 frames correspond to 13 seconds of videos when the frame rate of the camera is 7.5Hz, a typical value for thermal cameras. It has to be mentioned that the presented algorithm does not require any reference background frame to operate properly. In case that foreground objects appear in these frames, due to the fact that they are moving objects, they will be modeled by components with very low weight and thus they will slightly affect the background estimation process. Furthermore, by employing the updating mechanism the model will be adapted to new frames and will discard the components that model foreground objects. The history of frames could have been chosen to include 150 or 200 or even more frames. However, increasing the length of the history increases the computational requirements for model initialization. Since this work presents a model for in-camera background subtraction, frames were chosen due to the fact that this number of frames is sufficient for describing the current dynamics of a scene and at the same time the computational cost for initializing the model is acceptable.
Vi-D The Background Subtraction Unit
The BSU, depicted in Fig.4, is responsible for classifying the incoming pixels into the two available classes (background and foreground) and also updating the background model according to the new pixel response. For this reason, BSU takes as input a new pixel value () and the current Gaussian mixture for this pixel, which is stored outside the chip, and gives as output the updated Gaussian mixture, as well as, the probabilistic classification of the incoming pixel.
During the background subtraction task the FIFO based scheme processes all pixels of one frame before proceed to next frame; from the same frame it loads parallel batches of pixels depending on the number of parallel cores on chip. This way, we can achieve lower latency during processing and also have lower buffering when accessing the camera sensor. On the contrary, during the initialization of the system, the FIFO based scheme processes for each pixel a history of intensities, since it is required for the parameter estimation task.
Vii Experimental Validation
Vii-a VI Mixture Model Fitting Capabilities
During experimental validation, we evaluate the Variational Inference Mixture Model (VIMM) in terms of fitting accuracy and computational cost. We compare VIMM with the GMM and DPMM. The GMMs are employed under two different settings; (i) with more and (ii) less components than the underlying distribution. For experimentation purposes, we create one synthetic dataset from three non overlapping Gaussian distributions. The initial value for the number of components for VIMM is set to . In order to compare our method with the conventional GMMs of fixed number of components, we create two Gaussian models of and components respectively. These numbers are arbitrarily chosen, since the correct number of components is not a priori known.
Fig. 5 presents the fitting performance of all models. Our method correctly estimates the number of components. The GMM with components under-fits the data, since the underlying distribution comes from Gaussians. The GMM with components also under-fits the data. This is very likely to happen when the underlying distribution of the data is simpler than the structure of the GMM. In such cases, the GMM is likely either to overfit the data by assigning some components to outliers or underfit the data by constructing several components to describe samples coming from the same Gaussian. The DPMM approach yields better results, since it is able to be adapted to the current data statistics, but it still under-fits the data. Table I presents time performance of the different models. All presented times were computed in Python and not in hardware implementation [see subsection VII-D].
Vii-B Updating Mechanism Performance
In this section we evaluate the quality of the proposed updating mechanism, with and without keeping in memory the observed data, and compare it against the updating mechanism presented in . The rationale behind the decision to explore both cases lies in the fact that we target special purpose hardware with very limited on-chip memory. In this respect, we have to validate that even without keeping the data in memory the algorithm performance is not affected at all.
Fig.6 presents the adaptation of all models. To evaluate the quality of the adaptation, we use a toy dataset with 100 observations. Observed data were generated from two Gaussian distributions with mean values 16 and 50 and standard deviations 1.5 and 2.0 respectively. The initially trained models are presented in the left column. Actually, there are two cases for evaluating the performance of the updating mechanisms. In the first case, the evaluation could have been performed by creating one more well-separated Gaussian. In the second one, which we have chosen to follow and is much harder, the performance is evaluated on a Gaussian distribution that overlaps with one of the two initial Gaussians. Therefore, we generated 25 new samples from a third Gaussian distribution with mean value 21 and standard deviation 1.0. The middle column indicates the adaptation performance after 25 new observations, while the right column after 50 new observations. Our model, either it uses the history of observed data or not, creates a new component and successfully fits the data. On the contrary, the model of  is not able to capture the statistical relations of the new observations and fails to separate the data generated from the overlapping Gaussians (middle and right columns). The quality of the presented updating mechanism becomes more clear in the right column, which presents the adaptation of the models after 50 new observations.
Vii-C Background Subtraction Algorithm Evaluation
Vii-C1 OSU and AIA datasets
For evaluating our algorithm, we use the Ohio State University (OSU) thermal datasets and a dataset captured at Athens International Airport (AIA) during eVacutate111http://www.evacuate.eu European Union funded project. Specifically, we used two OSU datasets, referred as OSU1 and OSU2, which contain frames that have been captured using a thermal camera and have been converted to grayscale images. On the contrary, the AIA dataset contains raw thermal frames whose pixel values correspond to the real temperature of objects.
OSU datasets [23, 43, 2] are widely used for benchmarking algorithms for pedestrian detection and tracking in thermal imagery. Videos were captured under different illumination and weather conditions. AIA dataset was captured using a Flir A315 camera at different Airside Corridors and the Departure Level. Ten video sequences were captured, with frame size pixels of total duration 32051 frames, at 7.5fps. The experimentation was conducted throughout the third pilot scenario of eVacuate project. For all datasets we created our own ground truth by selecting 50 frames randomly but uniformly distributed, in order to cover the whole videos duration. Then, we manually annotated this frames by creating a binary mask around the foreground objects.
We compared our method against the method of  (MOG), which is one of the most robust and widely used background subtraction techniques. MOG algorithm uses a pre-defined number of Gaussian components for building the background model. In order to perform a fair comparison we fine-tune the parameters of MOG algorithm for each of the two datasets to optimize its performance. Furthermore, we compare our method against the method of [43, 2] (SBG) used for background substraction in thermal data. This method uses a single Gaussian distribution for modeling the background and, thus, it often under-fits the data. Comparison against this technique can highlight problems that arise when the number of components of a GMM is underestimated. We do not compare our method against a DPMM-based background subtraction technique, like the one in , since its computational cost is high and we target low power and memory devices.
For estimating the background model, we utilized 100 frames as history and set the maximum number of components equal to 50. After the initialization of the background model, we observed that the models for OSU and AIA datasets consists of to and to components respectively. Fig. 7 visually present the performance of the three methods. As is observed, our method outperforms both MOG and SBG on all datasets. While MOG and SBG perform satisfactory on grayscale frames of OSU datasets, their performance collapses when they applied on AIA dataset, which contains actual thermal responses, due to their strong assumptions regarding the distribution of the responses of pixels and the peculiarities of thermal imagery i.e. high signal-to-noise ratio, lack of color and texture and non-homogeneous thermal responses of objects (see Section I). Then, an objective evaluation takes place in terms of recall, precision and F1 score. Regarding OSU datasets, MOG algorithm presents high precision, however, it yields very low recall values, i.e. the pixels that have been classified as foreground are indeed belong to the foreground class, but a lot of pixels that in fact belong to background have been misclassified. SBG algorithm seems to suffer by the opposite problem. Regarding AIA dataset, our method significantly outperforms both approaches. Although, MOG and SBG algorithms present relative high precision, their recall values are under . Figure 8 presents average precision, recall and F1 score per dataset and per algorithm.
Regarding computational cost, the main load of our algorithm is in the implementation of EM optimization. In all experiments conducted, the EM optimization converges within 10 iterations. Practically, the time required to apply our method is similar to the time requirements of Zivkovic’s method making it suitable for real-time applications.
Vii-C2 Change Detection Benchmark
Besides OSU and AIA datasets we evaluated the performance of our algorithm on the change detection benchmark - 2014 (CDB). The CDB provides five thermal videos, recorded at indoor and outdoor environments, along with their ground truth. For evaluating the performance of our algorithm, we utilized the same metrics as in CDB-2014, i.e. precision, recall and F1 score, specificity, False Positive Rate (FPR), False Negative Rate (FNR) and Percentage of Wrong Classifications (PWC).
|Cascade CNN ||0.951||0.899||0.920||0.997||0.003||0.049||0.405|
Table II presents the performance of our algorithm on the thermal datasets of CDB and compares it against the top two methods for each one of the metrics. The method of  outperforms all methods. Our method presents the second highest recall, however, due to its lower precision it presents lower F1 score. Although, our method performs slightly worst than the leaders of CDB 2014, it is much less complicated and thus suitable for implementation in-camera.
Vii-D Hardware Cost
The main argument of this work is that a novel highly accurate and demanding algorithm that needs to run in a 24/7 basis could be handled very efficiently by a reconfigurable device running as an in-camera accelerator. Thus, we primarily demonstrate our system in a low cost Xilinx Atrix7 FPGA device (xc7a200tfbg484-3). In addition, we deploy our system in a more powerful Virtex7 device (xc7vx550tffg1158-3) to show that it seamlessly scales to support more parallel cores. For the code synthesis and bitstream generation we used Xilinx Vivado and Vivado HLS. For validation and proof only purposes our system was implemented in a low end Zedboard evaluation platform powered by a small Xilinx Zynq device.
Table III shows the hardware utilization for the Artix7 device when implementing 4 BSU cores and 1 MEU core. We implemented only 1 MEU, since this unit operates only for the initialization and parameter estimation of the system, and thus, its performance is not crucial. Table III also shows that the critical resource is the LUTs and DSPs. This is justified by the fact that the operations involved in the algorithm are mostly multiplications and divisions, and thus, apart from the DSPs, additional logic and signals are necessary to route the intermediate results and handle all algorithm states. DRAM utilization is almost zero as all operations are per pixel and no further caching in data is necessary, since there is no need for keeping the observed data in memory. It should be mentioned that keeping the observed data in memory would prohibit the implementation of this algorithm in low memory devices. Table IV shows the hardware utilization of the Virtex 7 device when implementing 16 BSU cores and 1 MEU core. The resource utilization in this case follows the same reasoning as before. The above two hardware configurations are compared with a quad-core ARM Cortex A9 CPU (Exynos4412 SoC) clocked at 1.7 GHz with 2GB RAM and a low power mobile Intel i5 (2450M) Processor clocked at 2.5Ghz with 8GB RAM, which features two physical cores with hyper threading capability (4 threads in total). It is selected for the evaluation as it offers a competitive computation power per watt.
|Number of Flip Flops||143089||269200||53%|
|Number of Slice LUTs||119964||129000||92%|
|Number of DSP48E||506||740||68%|
|Number of Block RAM_18K||20||730||2%|
|Number of Flip Flops||241604||692800||35%|
|Number of Slice LUTs||269004||346400||78%|
|Number of DSP48E||1184||2880||41%|
|Number of Block RAM_18K||14.50||1180||1.2%|
|Artix 7 4–cores||17.36 fps||4.34 fps||3.45|
|Vertex 7 16-cores||69.88 fps||17.47 fps||3.49|
|ARM A9 4-cores||8.27 fps||2.07 fps||4.7-6.2|
|Intel i5 2-cores/ 4-threads||58.59 fps||14.56 fps||5.82|
|MOG  BF-537 DSP||3.57 fps||-||-|
For the Intel i5 the software compiler platform used was Microsoft Visual Studio 2012 and our code was optimized for maximum speed (-O2 optimization level). For the ARM A9 platform, we used a lightweight XUbuntu 13.10 operating system with a g++ compiler using -O2 and-O3 optimization level. In all software reference implementations OpenMP was also used to utilize all the available cores/threads of the platform. For the FPGA, we measure the exact clock cycles needed for segmenting a single pixel by a single core including loading and write back cycles. For this purpose, we use the Zedboard evaluation board. The exact clock cycles measured between 700-830 when real datasets are evaluated. These measurements are also verified for the proposed Atrix7 and Virtex7 devices using post-place and route timing simulation.
The I/O latency between the DRAM and the FPGA is completely hidden as the operations for each core are depending only on a single pixel and its corresponding background model. All this information is encoded in about 256 bits in average, thus a buffering scheme using simple FIFOs is easily implemented. The bandwidth demands between the device and the DRAM is no more than 250 MB/sec for 25FPS at 640x480 resolution, which is easily achievable even from low-end FPGA devices. In all the experiments for the Intel i5 and ARM A9 we start measuring latency times after the data are loaded in the DRAM of the CPU. This probably is in benefit of the CPUs as the cache is hot in most of the measured cases.
Table V shows that implementing just 4-cores in the Atrix7 device we get 17.3 FPS at 320x240 exceeding by far the capabilities of the FLIR A-315 thermal camera. The 4-core FPGA design outperforms the ARM A9 quad core CPU giving twice the FPS. In terms of power, Atrix7 consumes 4.6 watts based on Vivado’s Power analysis tool while quad core ARM A9 consumes about 3.5-4 watts As expected the Intel i5 utilizing 4-threads outperforms the two previous platforms offering also the best performance per core. Its consumption is measured at 26.2 watts and refers only to the CPU consumption. The Virtex 7 device offers better performance, as it is capable of fitting 16-BSU cores. In terms of power the Virtex7 consumes 18.6 Watts measured using Vivado’s Power analysis tool.
Looking at the energy metric μJ/pixel in Table V, both FPGA devices give similar μJ/pixel and also better than the Intel i5. For the ARM A9 this metric is expressed as a range as it is based mostly on specs. In our evaluation experiments we could measure the total dynamic power of the board using the ODROID smart Power5 but it is not possible to safely measure only the CPU core modules.
The last column in Table V refers to the work of  which implements the original MOG algorithm in an in-camera DSP processor (Blackfin BF-537) as a reference design for his proposed scheme. Even though it is hard to make a direct comparison, we see how challenging for embedded processors is it to keep up with the demanding task of background segmentation; even for a less accurate algorithm such as MOG.
In this work a novel algorithm for background subtraction was presented which is suitable for in-camera acceleration in thermal imagery. The presented scheme through an automated parameter estimation process, takes into account the special characteristics of data, and gives highly accurate results without any fine-tuning from the user. It is implemented in reconfigurable hardware using a HLS design flow with no approximations in accuracy, arithmetic or in the mathematical formulation of the proposed algorithm. Unlike previously introduced custom-fit hardware accelerators, our scheme is platform independent, scalable and easily maintainable. Finally, to the best of our knowledge this is the first time that the very demanding task of background subtraction can be executed to thermal camera sensor in real-time and at low power budget, which allows for a distributed new approach that avoids the bottlenecks of the existing centralized solutions.
Appendix A Derivation of Optimal Variational Distributions
Due to the fact that there is no term in (35b) that contains parameters from both sets and , the distribution can be factorized as . The distribution for is derived using only those terms of (35b) that depend on the variable . Therefore the logarithm of is given by
We have made use of , and we have denote as . (36c) suggests that
is a Dirichlet distribution with hyperparameters.
Using only those terms of (35b) that depend on variables and , the logarithm of is given by
For the estimation of , we use (37) and keep only those factors that depend on .
where , and .
After the estimation of , logarithm of the optimized the distribution is given by
The parameters and are given by