A Maximum Likelihood Approach to Speed Estimation of Foreground Objects in Video Signals

03/10/2020 ∙ by Veronica Mattioli, et al. ∙ 0

Motion and speed estimation play a key role in computer vision and video processing for various application scenarios. Existing algorithms are mainly based on projected and apparent motion models and are currently used in many contexts, such as automotive security and driver assistance, industrial automation and inspection systems, video surveillance, human activity tracking techniques and biomedical solutions, including monitoring of vital signs. In this paper, a general Maximum Likelihood (ML) approach to speed estimation of foreground objects in video streams is proposed. Application examples are presented and the performance of the proposed algorithms is discussed and compared with more conventional solutions.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Motion and speed estimation can be necessary processes in order to recover specific real-world related information arising from the movement of an object in a three-dimensional (3D) scene captured by an image acquisition system, e.g. a camera. When a video is captured, a 3D real-world scene is projected onto the two-dimensional (2D) camera plane, resulting in a sequence of digital images. Hence, motion in a video stream can be considered as the consequence of the projection of objects moving in the 3D scene [pesquet]. This phenomenon may arise when objects are moving and the camera is still, when the camera is moving and the objects are still or when both are moving.

Various motion estimation algorithms are widely documented in the literature, as reported, e.g., in [pesquet], [chen] and [VideoProcessing, Ch. 4]. In particular, the most popular techniques include differential methods and matching methods.

Differential methods aim to estimate the optical flow, defined as the apparent motion perceived from the variations in the pixel intensity patterns in the 2D image, that may be due either to a true motion in the 3D space or to illumination changes [optflow]. These approaches are based on gradient techniques that exploit the properties of spatial and temporal partial derivatives. The most popular solutions in this category are the Lucas-Kanade [kanade] and the Horn-Schunck methods [schunck]

, that rely on the brightness constancy model and the smoothness constraint of the velocity vector

[optflow], respectively. Due to these specific assumptions, the two methods may fail in properly describing some realistic scenarios, which may represent their main limitation.

Block matching approaches, the most widely employed matching methods, are based on the partition of the considered video frame into several blocks of pixels, that may or may not overlap, where each block is associated with a motion vector [VideoProcessing, Ch. 4]. Block matching criteria, despite being very straightforward and basic, are highly sensitive to noise and parameter setting, especially the block size. As a consequence, the estimated motion vector may not coincide with the true motion. In this category, algorithms to estimate the mean speed of a group of vehicles from Motion Pictures Experts Group (MPEG) video streams are presented in [yu] and [hu], where motion vectors are directly extracted from the considered stream.

To avoid problems related to differential and matching methods, we wish to investigate the application of a more fundamental estimation principle, namely the Maximum Likelihood (ML) approach [Kay], to estimate the speed of a foreground object in considered video sequences.

The reminder of the paper is organized as follows. In Section II the motion estimation model of a framed object in a video sequence is described and a brief discussion on background removal techniques is proposed. The ML estimation approach is presented in Section III. The performance of the proposed algorithms is discussed in Section IV on the basis of several experimental results. Finally, conclusions are drawn in Section IV.

Ii Observation Model

Ii-a Preliminary Definitions

A video can be defined as a time sequence of digital images, which are referred to as frames, whose spatial-intensity pattern may vary over time. It is hence a time varying multidimensional signal where two spatial components identify a pixel position within the frame, which can also be thought of as a two-dimensional image (per colour component) formed by projecting a real world 3D scene onto a 2D image plane. The motion and variation of an object in the real world correspond to changes in the pixel intensity values in the 2D image plane. Motion estimation is therefore a process that allows to recover real-world related information by analysing the time evolution of the framed area.

Before presenting the formulation of the motion model and discussing the ML approach to speed estimation of a framed foreground object, the following simplifying assumptions are introduced:

  • only framed areas including a single moving object are considered;

  • the capturing camera is still;

  • object transformations due to perspectival issues arising from the projection of the 3D real world scene onto the 2D image plane are not accounted for.

The first condition allows to simplify the model and test the effectiveness of the proposed algorithm in the simplest possible scenario. The camera is considered still to avoid superposed motions other than the considered object motion. Finally, the last condition assumes that no perspectival transformation affects the object. These simplifying assumptions allow us to concentrate on the main goal of this paper, which is motion estimation, and will be relaxed by further investigation currently undergoing.

Ii-B Object Motion Model

A video signal is composed by a set of frames sampled at instants , where is the discrete time index corresponding to the frame number, and is the sampling time, with being the sampling frequency, or frame rate, of the camera. A digital grayscale video signal can be described as a two-dimensional discrete function , where indicates the pixel position within the frame,111 denotes the transpose operator. with and , whose value corresponds to the pixel intensity. In particular, is the frame size, being and its height and width, respectively. As customary, the dynamic range of the pixel intensity is limited to the interval . In the case of coloured videos, a fourth dimension can be added to this function in order to specify the colour channel index.

Considering a framed object shifting with a simple translation in the two-dimensional projected camera plane, its motion can be described by a displacement vector , with horizontal () and vertical () components. Defining as the image of the still object of interest, the pixel intensities of a video stream can be modelled as the summation of a few main elements [Bri2]:


where is the still background, the object is shifted with a displacement , is a term which takes into account the occluded/un-occluded parts of the background and are samples of independent and identically distributed (i.i.d.) zero-mean Gaussian noise. The background is assumed static, i.e., time-invariant. Equation (1) can be considered as the most general model to describe a single framed moving object in a video stream. This model applies directly to a group of objects subject to the same displacement.

Consider now the special case of constant speed for further simplification. The displacement term can be explicitly written as , where is the uniform speed motion vector in pixel/frame, that we wish to estimate. Hence, the observation model in (1) can be rewritten as:


where the direction and the speed of the object displacement have been considered time-invariant for the sake of simplicity. This model can be easily applied to a slowly time varying scenario, provided can be considered approximately constant over a sufficiently long time window.

Models (1) and (2

) are formulated in the time domain. Using a frequency domain representation may result more convenient, as discussed in


. The main advantage of operating in the frequency domain relies on the properties of the Fourier Transform (FT). In particular, thanks to the shift theorem it is possible to describe a displacement in the time domain as a linear phase term in the frequency domain. Hence, the dependence of

on the parameter can be factored out when working in the frequency domain. Defining the Discrete Fourier Transform (DFT) of a generic two-dimensional discrete function of size as in [Sol], the observation model in (2) can be expressed as:


where is the vector collecting the two discrete indeces of the two-dimensional DFT, with , and is the vector of the normalized spatial frequencies.

Equations (2) and (3) describe Gaussian observations that are independent both in discrete time and frequency domains. Focusing on the model in (3), the speed vector  represents the only unknown parameter to be estimated.

The ML criterion can be applied to the model in (3) to obtain an expression for the speed vector estimator . Before detailing the estimation procedures, we briefly discuss the role of background removal techniques that allow to detect the foreground object and that may be useful to further simplify the models in (2) and (3).

Ii-C Background Removal

Background removal techniques aim to separate the background from the foreground moving objects. The basic requirement is that the background is static. Nevertheless, this condition is often violated especially in outdoor scenes that are subject to illumination changes and other variations, e.g., due to wind effects or shadows. For this reason, adaptive algorithms for background removal, such as the Gaussian Mixture Model (GMM)

[stauffer], are usually preferred. The concept of background removal relies on the analysis and the comparison of two frames in the considered video stream: the background frame, also referred to as the reference frame where no moving objects are present, and the frame at a chosen instant where moving objects are present. The difference between these two frames represents the foreground. A comprehensive review on background removal approaches is proposed in [piccardi].

In this paper, we use the reliable GMM that adaptively models each pixel belonging to the background as a mixture of Gaussian distributions. The interested reader is referred to

[stauffer] for an accurate description of the proposed method.

Using the GMM technique, it is possible to remove the background-related terms and to concentrate on the foreground and simplify the observation model in (2) as:


Likewise the frequency domain observation model in (3) simplifies as:


Iii Maximum Likelihood Speed Estimation

Iii-a Speed Estimation with Included Background

The observation model (2) and its frequency-domain counterpart (3) include the terms and

, respectively, to account for the background occlusion by the moving foreground object. This masking operation requires the extraction of the foreground and is equivalent to background removal discussed in the next subsection. As a heuristic approach, the term

in (3) can be simply neglected [Bri1]. The resulting observation model is obtained by setting in (3).

Following standard methods described in [Kay], the ML criterion can be applied to obtain an expression for the log-likelihood function to be maximized. A likelihood function of the observed data in (3), with , can be defined on the basis of a window of observed frames as:



is the standard deviation of the additive Gaussian noise elements. A log-likelihood function can be derived from (

6) as:


The operating principle of the ML approach is to find the value of that maximizes the log-likelihood function in (7). Equivalently, the term (a) can be minimized by observing that the term (b) and the multiplicative coefficient are irrelevant and constant, being independent from . The term (a) in (7) can be written explicitly as:


where indicates the real part. The terms , and can be neglected because irrelevant, since independent of . Hence, minimizing (8) corresponds to maximizing the following quantity, where three terms are highlighted:


Using the linearity of and sum operators, can be factored out from (T1) and (T3) to obtain:


The continuous-frequency FT of the temporal sequence can be now defined as:


Setting , equation (10) can be expressed as:


where is the temporal mean of the sequence and is defined as:


Finally, an expression for the estimator of the speed vector with included background is:


Iii-B Speed Estimation with Omitted Background

Following the afore described procedure, it is possible to derive an estimator of the speed vector in the case of omitted background according to model (5), by setting .

In this case, (11) becomes


and (14) simplifies as:


Iv Numerical Results

In this section, the performance of the presented ML speed estimation algorithms is discussed. In particular, the results are presented in terms of Root Mean Square Error (RMSE) between and the correct speed of the framed object for two different categories of videos. A set of synthetic and software-generated videos is considered to preliminarily test the effectiveness of the derived methods in a controlled environment, then a number of real-world videos, specifically recorded for this purpose, is analysed. Comparisons with the well-known block-matching method [VideoProcessing, Ch.4] are also presented.

Iv-a Software-Generated Videos

As a first experiment, we apply the proposed speed estimation methods to synthetic videos obtained by inserting real pictures of moving objects in an artificial environment (i.e. a static background) and by adding white Gaussian noise to simulate the behaviour of a real image acquisition system. A sample frame of a grayscale video considered for this purpose is shown in Figure 1(a), where the moving object is the (highlighted) bird, placed in the upper part.222The bird has been highlighted to ease visualization in Figure 1(a) only, not in the processed video. The main advantage of synthetic videos relies on the possibility to manually or automatically set a number of parameters, including the speed components, thus allowing a direct assessment of the estimation performance.

In particular, we consider a set of videos in order to test as many values of speed components, measured in pixel/frame. The duration of each video is about s with a frame rate Hz and a frame size of pixels. The total number of frames is , of which the first are background frames, exploited by the GMM algorithm, and in (11) and (15

). The initial variance of the Gaussian distributions in the mixture (

in this case) is set to  [stauffer], as a good compromise to account for the variations due to the additive white Gaussian noise (AWGN) and the motion of the foreground object.

Using (14) and (16) the speed components are estimated in the cases of included and omitted background. Results are shown in Figure 1(b), along with a comparison with the well-known block-matching approach, for which a block size of pixels is chosen by trial and error to take into account the foreground size. For each video, the RMSE normalized to the correct values of the speed components for different noise realizations is measured. The obtained normalized RMSE values are then averaged with respect to the various videos and the result is plotted against increasing values of the noise variance for all the assessed algorithms.

Figure 1(b) shows the robustness of the proposed methods against noise variations for both the considered cases. The average RMSE is zero for low values of noise variance because the correct and estimated speed components are quantized to integer values. In particular, the case of omitted background gives lower errors up to a noise variance of about , while the case of included background is more efficient for higher values of noise variance. On the other hand, the high noise sensitivity of the block-matching criterion is confirmed by observing the rapidly increasing trend of its RMSE curve.

Iv-B Real-World Videos

As realistic examples, real videos of moving cars are considered. In particular we apply the proposed algorithms to a set of videos with various duration that ranges between about and s and with various numbers of background frames. The videos are recorded with a frame rate  Hz and have a frame size of pixels. The number of frames in (11) and (15) ranges between and . Furthermore, the initial variance of the Gaussian distributions in the mixture model is set to to track scene modifications due to the lightning changes and wind effects in addition to the the AWGN and motion of the foreground object, as in the previous scenario. The car reference speeds are manually measured and can be considered approximately constant over the entire video duration. A sample frame of a video in this category is shown in Figure 1(c).

Following the previous approach, our speed estimation methods are applied with included and omitted background, and compared with the performance of the block-matching algorithm, for which the block size is again set to pixels.

As in the previous example, for each video the RMSE normalized to the correct speed components is measured for different noise realizations and the obtained values are again averaged with respect to the various videos. In Figure 1(d) the results are plotted against increasing values of the noise variance for the three algorithms. For very low noise, e.g. for , the performance of the block-matching approach provides a low error, nevertheless its high noise sensitivity is again confirmed by the rapidly increasing trend of its RMSE curve. This demonstrates the robustness of the proposed ML estimation techniques, which exhibit lower values of RMSE even at much higher values of noise variance.

In this example, the average RMSE with included background settles about , but for small values of noise variance, because speed estimation consistently fails in some of the considered videos. However, the performance of the algorithm with omitted background is significantly better. We also note that the RMSE values for low noise variance are not exactly zero, unlike the previous case. This is due to the used quantization of the speed component estimates to integer values, unlike the non integer values of the correct speed components. Furthermore, the nature of real videos, including artifacts such as illumination changes and wind effects, may affect the operation of background removal and foreground detection performed by means of the GMM algorithm.

Fig. 1: Examples of applications and results: (a) sample frame of a synthetic video, (b) average RMSE for synthetic videos, (c) sample frame of a real video, (d) average RMSE for real videos.

V Conclusion

In this paper, we presented a novel approach for speed estimation in video signals. We derived an observation model for included and omitted background and we applied the ML criterion to estimate the speed components of a foreground object shifting with a nearly constant speed. By considering synthetic and real videos as examples of application, we demonstrated the effectiveness of the proposed algorithms by comparison with the well-known block-matching algorithm.