Motion estimation in consecutive video-frames is one of the important techniques in image processing or computer vision community. The motion estimation is defined as estimating the motion velocity fields (vectors) of objects appearing in successive two (video) frames. In the research field of computer vision, the so-called Markov random fields (MRFs for short) have been used to solve the various problems concerning image processing such as image restoration, texture analysis and segmentation [2, 3, 4]5, 6]
and so on. The MRFs enable us to regularize the ill-posed problems in such a lots of subjects, and then, the original problem can be treated as combinatorial optimization problems under some ‘soft’ or ‘hard’ constraints. Actually, Zhang and Hanouer (1995) and Wei and Li (1999) 
applied the MRFs approach with the assistance of the framework of Bayesian statistics to estimate the motion vector for a given two consecutive digital images. They also utilized the so-called mean-field approximation to carry out the extensive sums in the marginal probability of the posterior and showed that the steady states of the mean-field equations are one of the good candidates for the appropriate motion velocity fields. The same kind of the MRFs approach was implemented by making use of the DSP-based image processing board of SIMD (Single Instruction Multiple Data) machine by Caplier, Luthon and Dumontier (1998) and Luthon, Caplier and Lievin (1999) . They demonstrated that the task to estimate the motion velocity is actually carried out within a realistic time.
In the study by Zhang and Hanouer (1995), they set the so-called hyper-parameters which specify the probabilistic model macroscopically to some ad-hoc values without any reasonable explanation. However, there is no theoretical (statistical) justification for such ad-hoc choices of parameters to estimate the appropriate motion velocity fields. Of course, the selection of hyper-parameters is dependent on a given set of consecutive video-frames and it is important for us to determine the hyper-parameters systematically under some statistical criteria so as to give a fine (if possible, an optimal) average-case performance of the motion estimation.
Taking into account the above requirements from both theoretical and practical sides, from the view point of Bayesian statistics, we examine a mean-field approach with the assistance of the Markov chain Monte Carlo method (the MCMC for short) to estimate both motion velocity fields and hyper-parameters simultaneously in successive video-frames described by spatio-temporal MRFs. We find that mean-field variables in the non-linear maps diverge due to improper normalization factor of regularization terms appearing in the cost function. In order to overcome this difficulty, we rescale the regularization terms by introducing a scaling factor and optimizing it by means of minimization of the mean-square error. We reveal that the optimal scaling factor stabilizes the mean-field iterative procedure of the motion velocity fields estimation. We next attempt to estimate the optimal values of hyper-parameters including the regularization term, which define our probabilistic model macroscopically, by using the Boltzmann-machine type learning algorithm based on gradient descent of the marginal likelihood with respect to hyper-parameters. In our framework, one can estimate both the probabilistic model (hyper-parameters) and motion fields simultaneously. We show that our motion estimation is much better than the result given by Zhang and Hanouer (1995) in which hyper-parameters are set to some ad-hoc values without any theoretical explanation.
This paper is organized as follows. In the next section II
, we explain our general set-up to deal with the motion velocity estimation by means of spatio-temporal MRFs according to Zhang and Hanouer (1995). From the view point of Bayesian inference, we construct the posterior probability and introduce two kinds of estimations, namely, Maximum A Posteriori (MAP for short) and Maximizer of Posterior Marginal (MPM for short) estimations. In sectionIII, we utilize the mean-field approximation to obtain the MPM estimate and derive the non-linear mean-field equations with respect to the motion velocity fields. As a preliminary, we demonstrate our mean-field approach by setting the hyper-parameters to the values chosen by Zhang and Hanouer (1995) and show that the mean-fields diverge leading up to a quite worse estimation of motion velocity in section IV. To avoid this type of difficulty, we shall rescale the regularization term by introducing a scaling factor and optimizing it by means of minimization of the mean-square error. In section V, we attempt to estimate the optimal values of hyper-parameters including the regularization term, which define our probabilistic model macroscopically, by using the Boltzmann-machine type learning algorithm based on gradient descent of the marginal likelihood with respect to hyper-parameters. In our framework, one can estimate both the probabilistic model (hyper-parameters) and motion velocity fields simultaneously. To proceed to solve the learning equations, we utilize two different ways to carry out the sums coming up exponential order appearing in the learning equations, namely, hybridization of mean-field approximation and MCMC, and simple MCMC. We find that average-case performance of our motion estimation is much better than the result given by Zhang and Hanouer (1995) in which the hyper-parameters are set to some ad-hoc values. The last section is summary.
Ii General set-up of motion estimation
In this section, we briefly explain our model system.
Ii-a Spatio-temporal Markov random fields
Let us define a single two-dimensional gray-scale image as a ‘video-frame’ by . denotes a set of pixels in image and index is related to a point in two-dimensional square lattice . Here we shall assume that a motion picture consists of successive static images (frames), namely, we distinguish each static image in the motion picture by time index as . When we compare the consecutive two static images, that is, and , each pixel in might change its location with some ‘motion velocity’. From this assumption in mind, we introduce velocity fields defined by . Namely, for each and for successive two video-frames, a constraint should be satisfied, where ‘index’ is related to a single point in the two-dimensional vector field. In this paper, we consider that each component of the vector takes a discrete value and the range is limited as . It might seem that this range is extremely small in comparison with the range of the grayscales in images (from to ) or image size (), however, if one attempts to construct a detection and alarming system for the dangerous state from ‘infinitesimal difference’ of patient’s breath in ICU (Intensive Care Unit), the limitation of the velocity fields to such a small range is rather desirable (reasonable).
Ii-A1 Line fields and segmentation fields
Obviously, it is impossible to determine the uniquely from just only information about two video-frames and . To compensate this lack information, we introduce line fields and segmentation fields.
The line fields guarantee the continuousness between arbitrary two motion velocity fields for the nearest neighboring pixels and we assume that these two motion velocity fields might take similar values. Let us define these line fields by . Here and are labels to represent continuousness between velocity fields in the nearest neighboring (n.n. for short) horizontal and vertical pixels. In other words, we shall define
On the other hand, the segmentation fields are introduced to distinguish ‘predictable areas’ and ‘unpredictable areas’ in the motion velocity fields. Here ‘unpredictable areas’ means regions hided by some objects before they are moving to somewhere else. Thus, we naturally define the segmentation fields by with
Ii-B Bayes rule and posterior probability
In the previous subsections, we defined the motion picture as a series of successive static images by spatio-temporal Markov random fields. To determine the motion velocity fields uniquely, we also introduced the line and segmentation fields. Then, our problem is clearly defined as follows.
Now, our problem is to infer the velocity vector field , line field and segmentation field under the condition that two consecutive video-images and are observed. For the above problem, we easily use the Bayes rule to obtain the posterior probability, which is a probability of provided that and are given as
where we defined the sums appearing in the above formula by with
For the above posterior, we have the so-called Maximum A Posteriori (MAP) estimate by
whereas, what we call Maximizer of Posterior Marginal (MPM) estimate is given by
where we defined the marginal probability by
The average appearing in (6) is defined as and denotes a function to convert the expectation having a real number into the nearest discrete value.
Ii-B1 Likelihood function
The likelihood function appearing in the posterior can be regarded as a probabilistic model to generate the next frame provided that the unknown fields and the frame in the previous time are given. From now on, we omit the -dependence of the fields because we consider the motion velocity fields for a given set of just only two consecutive video-frames. Then, we assume where the cost function is given by
where means a set of nearest neighboring pixels around pixel . The number of these pixels is (square lattice). The parameters and are the so-called hyper-parameters which determine the probabilistic model macroscopically.
Ii-B2 Prior probability
where we defined the norm by
and and are also hyper-parameters which define the above probabilistic model macroscopically.
Then, the posterior , namely, the probability of the desired fields for a given set of two successive video-frames is constructed by the product of likelihood and prior , that is .
By means of the cost function, we have
The total cost of the system, which is now defined by , is written as
where the first term appearing in the right hand side of the above cost function is introduced to prevent pixel at the location from moving to the position where is quite far from . The second term confirms the continuousness between velocity vectors for the nearest neighboring pixels and we easily find that the term is identical to the Hamiltonian (energy function) for the so-called dynamically diluted ferromagnetic Q-Ising model in the literature of statistical physics, that is to say, we have
in the limit of . The third term in (11) denotes a correlation between the line and the segmentation fields. The forth term represents a correlation between the line fields and the distance of pixels located in the nearest neighboring positions. The last term controls the number of non-zero segmentation fields and this term can be regarded as the so-called chemical potential in the literature of statistical physics.
Iii Mean-field equations on pixel
In the previous section, we constructed the posterior by making use of the Bayes rule. Therefore, we can use both MAP and MPM estimations by means of (5) and (6), respectively. Here we should notice that the MAP estimate is recovered by means of
with . From the above definitions, the MPM estimate is obtained by . Therefore, our problem now seems to be completely solved. However, the number of sums appearing in the expectation
comes up to exponential order as . Obviously, it is impossible for us to carry out the sums even for the system size is within a realistic time.
Then, we use the mean-field approximation to overcome this type of computational difficulties. Namely, we rewrite the cost function by replacing the motion velocity fields with the corresponding expectations except for a single component of the fields. For instance, for say , we have the mean-field approximated cost function as follows.
By using the same way as , we have for as
and obtain for as
where stands for a delta-function. By means of the above approximated cost functions, one obtains the following self-consistent equations for .
Regarding the above self-consistent equations with respect to single-site averages as the following ‘non-linear maps’:
we look for the steady states of the above maps which should satisfy the following convergence condition.
where should be a small value, say . In general, a control parameter is time-dependent variable as and the MAP estimate is obtained by controlling it as as . On the other hand, the MPM estimate is constructed by setting the to during the above iterations.
Generally speaking, the steady state is different from which is a solution of the self-consistent equations, however, it might assume that the more likely to be close to if the landscape of the cost is not so complicated like spin glasses .
Iv Preliminary : divergence of mean-fields
To check the usefulness of the above procedure, we examine our mean-field algorithm to infer the motion velocity fields for a given set of two successive frames shown in Fig. 1.
It should be noted that these two frames are artificially given and obviously, the true motion velocity vector fields are now explicitly provided for us to check the usefulness of our mean-field algorithm.
Generally speaking in the Bayesian inference, setting the hyper-parameters appearing in the probabilistic model is one of the quite important tasks and here we examine the values which were given ad-hoc by Zhang and Hanouer (1995).
We find that for the above choice of the hyper-parameter causes a divergence of the mean-fields such as due to the regularization terms or which appear in the mean-field equations. We show the resultant velocity fields calculated by the above choice of hyper-parameters in Fig. 2. We find that the velocity fields shrink to a few points with small lengths and one apparently fails to estimate the true velocity fields.
Iv-a Optimization of scaling factor
The origin of the above difficulty apparently comes from the divergence of these regularization terms evaluated for two extremely different values of pixels, for instance, say and which leads to . This fact tells us that there exist several serious cases (combinations of two consecutive video-frames) for which the ad-hoc hyper-parameter selection causes this type of divergence during the iteration of mean-field equations.
To avoid the essential difficulty, we rescale the hyper-parameter as and optimizing the scaling factor from the view point of several different performance measures.
Iv-A1 Performance measures
We first introduce two different kinds of mean-square errors as average-case performance measures to determine the optimal scaling factor .
where and we should keep in mind that holds. is a true velocity field for a given set of two successive images shown in Fig. 1. Thus, the denotes the mean-square error defined by the difference between the true and the estimated velocity fields for zero segmentation regions. On the other hand, is the mean-square error evaluated for non-zero segmentation regions.
We also introduce the bit-error rate which is defined as the number of estimated pixels which are different from the true ones. Namely, we use
where means a Kronecker’s delta which is defined by
where is a ‘conventional’ Kronecker’s delta.
In Fig. 3, we plot the behaviour of two kinds of the mean-square errors (upper left), the bit-error rates (upper right) as a function of scaling factor . The lower panel shows the resultant velocity fields obtained by setting the optimal scaling factor . From these panels, we find that the resultant velocity fields are very close to the true fields when we set the scaling factor appropriately. However, the ad-hoc choice of the other hyper-parameters should not be confirmed for the best possible velocity fields estimation for a given other set of the successive images. To make matter worse, in practice, we can use neither mean-square error nor bit-error rate because these quantities require the information about the true fields (for instance, see the definition of ). Therefore, we should seek some theoretical justifications to determine the optimal hyper-parameters.
V Maximum marginal likelihood criteria
In statistics, in order to determine the hyper-parameters of the probabilistic model which contains latent variables , the so-called maximum marginal likelihood estimation is widely used. The marginal likelihood (the type-II likelihood) is defined by
namely, the marginal likelihood is obtained by taking the sums of these latent variables in the (log) likelihood function. It should be noted that the above marginal likelihood is dependent on the ‘input’ two successive frames . We can easily show that the marginal likelihood is maximized at the true values of the hyper-parameters , namely,
where we defined the observable data-average by .
V-a Kullback-Leibler information
Taking into account the fact that the Kullback-Leibler (KL) information can not be negative, we can easily show the inequality (LABEL:eq:marginalL).
Let us consider the KL information between the true probabilistic model and the model . Then, from the definition of the KL information, we immediately have
The equality holds if and only if . Therefore, the inequality (LABEL:eq:marginalL) holds and this means that the marginal likelihood takes its maximum at the true values of the hyper-parameters. We use this fact to determine the hyper-parameters. In other words, the marginal likelihood is regarded as a ‘cost function’ whose lowest energy states might be a candidate of the true hyper-parameters.
Vi Hyper-parameter estimation
As we saw in the previous section, we should determine hyper-parameters so as to minimize the marginal likelihood. In this section, we attempt to construct the Boltzmann-machine type learning equations which are derived by means of taking a gradient of the marginal likelihood with respect to the hyper-parameters.
Vi-a Boltzmann-machine learning and its dynamics
Let us define as a conjugate statistics for the parameter
. Then, the Boltzmann-machine learning equation is obtained as
Namely, we have