Super-resolution (SR) is an information processing technique that makes it possible to infer a spatially high-resolution (HR) image of a scene from corresponding multiple low-resolution (LR) images that are affected by warping, blurring, and noise. SR can be applied to a variety of images; e.g., still images extracted from several sequential video frames. SR needs the registration of LR images in addition to the image restoration of the registered LR images. Since the earliest work by Tsai and Huang , SR has been achieved using various methods [3, 10, 6, 8, 5, 2, 9, 4, 7] and good overviews of these methods are given in [11, 13, 12, 14, 15, 16]. Generally, SR is an ill-posed inverse problem because inverting the blur process without amplifying the effect of the noise is difficult 
. In other words, the degrees of freedom of the HR image and pixel-wise observation noise are always higher than the dimensionality of the observed LR images, so complete determination of an HR image is impossible. Therefore, the HR image is frequently inferred as the most preferable image within the framework of the probabilistic information processing, and we handle SR using this framework in this paper. The probabilistic information processing has three key features: 1) model, 2) objective function, and 3) optimization method. In the SR problem, the model includes the observation model and the prior model. The observation model consists of warping, blurring, downsampling, and noise models. The prior model, necessary for the Bayesian framework, mainly consists of an HR image prior, and sometimes includes both the hyperparameter prior for the HR image prior and the registration prior. The objective function evaluates how good or bad an estimator is. The estimator usually represents the inferred HR image, and sometimes includes auxiliary parameters; e.g., the registration parameters and edge information. The optimization method numerically maximizes/minimizes the objective function and determines the estimator. An optimization method is not necessary for simple problems in which an analytical exact solution can be obtained. In the probabilistic information processing, SR can be categorized according to these three key features.
To deal with warping, blurring, and downsampling, a linear transformation model is frequently used[3, 6, 8, 10]. Warping is usually limited with planar rotation and parallel translation. Blurring is defined by using a point spread function (PSF); a square or Gaussian type PSF is common. Downsampling denotes sampling from an HR image to construct an LR image. Downsampling sometimes includes anti-aliasing. Since these three transformations are linear, they can be combined into a single transformation matrix. As for the noise model, pixel-independent additive white Gaussian noise (AWGN) is usually employed.
The Bayesian framework, especially the HR image prior, is quite useful for SR. The HR image prior provides appropriate smoothness between neighboring pixel luminances. A common type of HR image prior imposes an L2-norm penalty on differences between horizontally and vertically adjacent pixel luminances (the first derivative). The L1-norm of the first derivative is sometimes used, and it has the advantage of robust inference against outliers. The total variation (TV) prior
employs the L1-norm of the gradient vector. The Huber prior is a mixture prior of L1- and L2-norms. The SAR model [2, 17, 9] employs the response of a two-dimensional Laplacian filter (the second derivative). The Gaussian process prior 
has neighboring pixels spread according to a Gaussian distribution. Besides the degree of smoothness between neighboring pixels, information regarding the discontinuity, or equivalently, the edges or line process, is also useful for inference. A common type of prior implementing edges is the compound Markov random field (MRF) prior that was introduced by Geman & Geman and is widely used [4, 6, 8]. With respect to the compound MRF [19, 20] prior, the normalizing constant, or equivalently, the partition function, is usually difficult to calculate because it has an exponential calculation cost with respect to the dimensionality of the line process. Recently, Kanemura et al. [6, 8] confusingly introduced a “causal” type of Gaussian MRF prior whose calculation cost is polynomial. We try to improve this prior in this paper.
The SR estimator should be derived from an objective function. As the objective function, a posterior distribution has been widely employed. Since the posterior distribution usually includes both the HR image and registration parameters, the joint maximum a posteriori (MAP) solution  is a suitable estimator for this objective function. Other than the joint MAP, the use of the marginalized maximum likelihood (ML) [3, 6] or marginalized MAP  has been proposed. Tipping et al.  and Kanemura et al. [6, 8] determine the registration parameters by using ML inference, where the HR image is marginalized out, and determine the HR image by using MAP inference. Pickup et al.  determines the HR image by using MAP inference, wherein the registration uncertainties are marginalized out, and assumes that the registration parameters are pre-registered by using standard registration techniques. Marginalized ML is also called type-II ML, evidence approximation, or empirical Bayes. Marginalized ML has no registration prior, unlike marginalized MAP. Pickup et al.  reported that marginalized MAP is superior to both joint MAP and marginalized ML. We evaluate the accuracy of SR methods in terms of the L2-norm (mean square error) -based peak signal-to-noise ratio (PSNR). Therefore, we think it is natural to employ PSNR as the objective function. For this objective function, posterior mean (PM) is a suitable estimator. The variational Bayes  approach  seems to approximately determine the PM of the HR image, although the authors assume some registration parameters are known and use point-estimate model parameters obtained by ML inference. To determine the exact PM of the HR image, all parameters other than the HR image should be marginalized out over the joint posterior distribution.
The type of optimization method to use is not as substantial a problem as the choice of model and objective function, but it is still important. Since almost all good estimators cannot be exactly determined because of difficult analytical integration or an exponential calculation cost, some approximation methods need to be introduced. Also, parameter tuning is necessary in many numerical optimization methods; e.g., of the initial value and the step-width settings in gradient methods. Specifically, in early work done on image restoration, an annealing method was used for the joint MAP solution [18, 22]. For marginalized ML and marginalized MAP solutions, the scaled conjugate gradients algorithm was used [3, 5]
. In recent work, the variational expectation-maximization (EM) algorithm has been applied, which includes the gradient method in the M step[6, 8]. The variational Bayes approach has also been applied . This method includes nested optimization of the majorization-minimization approach. This majorization-minimization approach seems to affect both the HR image prior and the estimator. Specifically, it modifies the TV prior to include a discontinuity parameter (called local spatial activity). In addition, this parameter is point-estimated when the HR image is inferred.
In this paper, we propose a new SR method that employs a “causal” Gaussian MRF prior and utilizes variational Bayes to calculate the optimal estimator, PM, with respect to the objective function of the L2-norm-based PSNR. This is a straightforward approach, but it was not proposed earlier possibly because an important limitation of variational Bayes is that a conjugate prior is needed. We solve this problem through simple Taylor approximations. In Section II, we define models, where we introduce a novel unified warping, blurring and downsampling model, an improved HR image prior, an improved hyperparameter prior, and a registration prior. In Section III, we employ PSNR as the objective function and derive the optimal estimator, PM, from this objective function. In Section IV, we determine the PM by using variational Bayes and Taylor approximations. In Section V, we evaluate the proposed method by comparing it with existing methods. We discuss the proposed method in Section VI and conclude in Section VII.
First, we define the gamma, Bernoulli, and Gaussian distributions used in this paper:
Here, is the gamma function, denotes the determinant of a given matrix, superscript denotes the transpose, is the real number field, and is the dimension of . The logistic function and Kullback-Leibler (KL) divergence from distributions to are respectively defined as
where the angle brackets denote the expectation of with respect to a distribution . Additionally, denotes the trace of a given matrix. denotes a diagonal matrix.
is an identity matrix of appropriate size.
is a zero vector or a zero matrix of appropriate size. All the vectors in this paper are column vectors. Thedenotes the L2-norm of a given vector. At this point, these variables have absolutely nothing to do with the variables that appear later.
Ii-B Observation Model
Our task is to estimate an HR grayscale image, , from the observed multiple LR grayscale images, . Images and are regarded as lexicographically stacked vectors. The number of pixels for each LR image, , is assumed to be less than that of the HR image, ; i.e., . We do this estimation using an SR technique whose resolution enhancement factor is . Although we define the range of a pixel luminance value as infinite, we use for black, for white, and values between and for gradual gray.
The image observation process is modeled as shown in Fig. 1; the HR image is geometrically warped, blurred, downsampled, and corrupted by noise to form the observed LR image :
or, more strictly,
is AWGN with precision (inverse variance). Here, is the transformation matrix that is simultaneously used for warping, blurring, and downsampling. It is defined as
where represents the extent of the summation (explained in the next paragraph), and the vectors and respectively denote the two-dimensional positions of the -th pixel of the original HR image and the -th pixel of the observed LR image. We define the center of each image as the origin and the size of each pixel is by . For example, regarding an HR image with pixels, each represents . and represent the warping parameters of the -th LR image: the rotational motion parameter and translational motion parameter. The Gaussian distribution in (3) represents a Gaussian PSF that defines the blur, and represents its precision parameter. In this paper, we assume also differs for each observed image. These transformation parameters are packed into , which is defined as
where subscripts and , respectively, denote horizontal and vertical positions on the image.
In previous works [3, 6, 8], the extent of was defined as the extent of the HR image. According to this definition, however, the shape of the PSF is no longer Gaussian. For example, at the corner of the HR image, the shape is not omnidirectional but limited in a way such as that of a quadrant. In this paper, the extent of is defined as infinite, and the luminance values outside the HR image are defined as (middle gray). This normalization term faithfully represents the Gaussian PSF. We also found that this normalization term is exactly given by using the elliptic theta function , and we can rewrite as
The elliptic theta function includes an infinite series, but it is easily determined numerically because the convergence is quite fast. In (II-B), the normalization term (the denominator of the right-hand side) seems to depend on because includes , but this is not true. Because the elliptic theta function is a periodic function with respect to the argument with period , and can only take discrete values with step size for the horizontal and vertical directions, the normalization term has the same value with respect to .
Ii-C HR Image Prior
Here, we introduce a “causal” Gaussian MRF prior for the HR image and additional latent variables. These latent variables are called the line process that controls the local correlation among pixel luminances. The introduction of the latent variables enables explicit expression of the possible discontinuity in the HR image. The line process,
, consists of binary variablesfor all adjacent pixel pairs and . Its size equals . We define the prior as
Here, the summation is taken over all pairs of adjacent pixels. The notation means that the -th and -th pixels are adjacent in the upward, downward, leftward, and rightward directions. The line process switches the local characteristics of the prior. It indicates whether two adjacent pixels take similar values or independent values. When , the -th and the -th pixels are strongly smoothed according to the quadratic penalty, whereas there is no smoothing when . The hyperparameter is an edge penalty parameter that prevents from excessively taking edges. Note that is restricted to positive values because a negative leads to a reward rather than a penalty for taking edges. is a smoothness parameter that prevents the differences in adjacent pixel luminances from becoming large, and is a contrast parameter that prevents from taking an improperly large absolute value. On the other hand, in previous works [6, 8], is assumed to be , which results in an improper normalizing constant (see Discussion). is the precision matrix of .
We have defined the introduced causal Gaussian MRF prior in the joint distribution form ofand , i.e., . We call such a model “causal” because seems to cause . The MRF model is defined as having the property
in this case; i.e., the conditional distribution of a random variable,, given all other variables, and , equals the conditional distribution of the random variable, , given its “neighboring” variables, and . If this conditional distribution is a Gaussian distribution, such an MRF is called a Gaussian MRF.
The “compound” MRF prior is usually defined in the form of the Gibbs distribution ,
which is based on some microstate energy function, or equivalently, a Hamiltonian, such as
In addition to the property of (13), a compound MRF also has the property of
whereas the introduced “causal” Gaussian MRF prior does not. Therefore, we do not call the introduced prior a “compound” MRF prior, even though (8) and (14) have similar forms. Furthermore, the introduced “causal” Gaussian MRF prior is a generative model, whereas the “compound” MRF is not. A generative model has the advantage of reducing the calculation cost (see Discussion).
Ii-D Hyperparameter Prior
Generally, prior distributions should be non-informative unless we have explicit reasons because an informative prior leads to heuristics. Actually, we define the prior distributions for the hyperparameters of the HR image prior to be as non-informative as possible:
For a gamma distribution, the number of effective prior observations in the Bayesian framework is equal to two times parameter. As shown in the Appendix, the number of observations for the hyperparameter is in this SR. Also, that for and is , and that for is . Therefore, the above settings – e.g., – are considered sufficiently non-informative. Superscript is added because we use these parameters as the initial values of variational Bayes later.
Ii-E Registration Prior
For the registration parameters including the blurring parameter, we also define the corresponding prior as
For the rotational motion parameter , the prior assumes degree (). This assumption is considered suitable for this SR task. Similarly, an assumption of pixels for translational motion parameters and is considered suitable. For blurring parameter , is taken to be the value equivalent to the anti-aliasing of the scale factor .
Iii Objective Function and Estimator
Iii-a Peak Signal-to-Noise Ratio (PSNR)
First, we confirm that the joint distribution of all random variables can now be explicitly given as
Once the joint distribution is obtained, we can derive all the marginal and conditional distributions; e.g., the posterior distribution and joint distribution of the HR and LR images .
One of the most commonly used evaluation functions of the inferred image is the L2-norm (mean square error) -based PSNR. It is defined as
where is the estimator of the HR image and is the true HR image. Since only LR images, , are available for the estimator, we sometimes explicitly express it as a function form, . Now, our objective function (functional) to be maximized regarding the estimator is defined as
This is because we prefer good estimator performance on average over various HR images and the corresponding LR images. Here, we assume that the occurrence rate of HR and LR images exactly coincides with the model we just introduced.
Iii-B Posterior Mean (PM)
Using the above objective function, we can explicitly derive the best estimator of the HR image as the PM,
Here, we used the well-known fact that the PM coincides with the minimum mean square error estimator in Bayesian framework. Note that needs marginalization of all parameters other than over . If the PM of the line process or other model parameters is necessary, it can also be determined in the same manner.
Iv Optimization Method
Iv-a Variational Bayes
Though we could derive the optimal estimator, we cannot obtain the analytical solutions of the posterior distribution and marginalized posterior distribution . Consequently, we have to rely on approximations. Here, we employ variational Bayes.
Variational Bayes  provides a trial distribution that approximates the true posterior. We impose a factorization assumption on the trial distribution,
Note that, at this moment, the distribution family of each factorized distribution is not limited. We identify the optimal trial distribution that minimizes the KL divergence between the trial and the true distributions as the best approximation of the true distribution:
Actually, the trial distribution that minimizes the KL divergence, not from to but from to coincides with the product of the exact marginal distributions as
but this minimization is difficult to calculate.
Under the factorization assumption of the trial distribution and the extremal condition of the KL divergence, each optimal trial distribution should satisfy the self-consistent equations,
Each factorized trial distribution is supposed to converge to the optimal distribution. Sometimes, some s are used instead of s for the distribution on the right-hand side of (31). It depends on the hierarchical structure of the model. Similarly, some s may not be necessary.
Iv-B Taylor Approximations
Although variational Bayes is a widely used general framework, its application is difficult in practice because it requires a conjugate prior. The prior distributions we have introduced are not conjugate priors. However, we have found that simple Taylor approximations make them conjugate and enable the analytical exact expectations in (31).
Here, to simplify the notation, we define the mean values of the latent variables , the hyper parameters , and the registration parameters over the trial distributions in the step number of the updates of variational Bayes as , , , , , .
Specifically, we use first-order Taylor approximations for three non-linear terms. is approximated around ,
Similarly, is approximated around ,
We also use a similar approximation around . In addition, is approximated around ,
Iv-C Update Equations
For (30) and (31), we update those distributions as follows. First, we compute using . Second, we compute using . Finally, we compute using and using . Here, we simply compute only the parameters of those distributions because we can compute the expectations in (31) analytically by using Taylor approximations in (32), (IV-B), and (IV-B). Specific update equations are described in the Appendix.
For the initial parameters of the trial distributions of and , we use non-informative values,
For the initial parameters for , , , and , we use the same values as their prior’s values.
We obtain the well-approximated PM of as . Realistically, instead of , we use when the following convergence conditions hold for and each ,
where we defined as the scaling constant.
V Experimental Results
|[dB]||(proposed)||(vs bilinear)||(vs Kanemura)||(vs Babacan)|
The proposed method was evaluated using five gray-scale images with a size of pixels, as shown in Fig. 2. From each image, images with a size of pixels were created by using (1), (2) with the settings of the parameters , , and as the following. The resolution enhancement factor was 4. The transformation parameter was randomly created according to the prior distribution in (19). The noise level parameter was set for signal-to-noise ratios (SNR) of , , and dB for each image. Samples of the created images are shown in Fig. 3.
lists the quantitative results compared to those from the methods of bilinear interpolation, Kanemuraet al. , and Babacan et al. . Note that we added a slight modification to these methods because they employ slightly different models. For example, the original method  assumes the blurring parameter is known, so we set as the mean value of the true distribution for this method. Also, we introduced a strong prior for in the Kanemura method 
in contrast to the original method, because this parameter sometimes becomes negative. We evaluated the results with regard to the expectation and the standard deviation of the improvement in signal-to-noise ratio (ISNR) overexperiments on each image and for each SNR. ISNR is the relative PSNR defined as
where is the true HR image, is the image estimated by the proposed method, and is the image estimated by the compared method. A higher ISNR value means better improvement of the estimate against the estimate of the compared method. We see that the ISNRs of the proposed method were mostly higher than those of the other methods, except for the comparison with the Babacan’s method in Pepper image.
Table II lists the root mean square errors (RMSE) of our method and the other methods. To evaluate the estimated registration parameters, we took the RMSEs over experiments ( experiments 5 images) for each noise level. Of course, a lower RMSE value means a better estimate. We see that the RMSEs of the proposed method were mostly higher than those of the other methods.
With regard to the observation model, we used a linear transformation and AWGN. The use of the linear transformation model is advantageous since an arbitrary transformation matrix can be employed because of the Taylor approximation. The transformation matrix can be constructed by multiplying three matrices: the warping, blurring, and downsampling matrices . A disadvantage of this is that sub-pixel errors might accumulate. We prefer matrix construction via a continuous function . We improved the construction by introducing an elliptic theta function for the normalizing constant in (II-B). This normalizing constant provides fair pixel weights for both marginal and central areas of the HR image and faithfully represents the Gaussian PSF.
With regard to the HR image prior, we used a causal type of prior, which was first introduced by Kanemura et al. [6, 8]. The microstate energy function, or equivalently, the Hamiltonian, -based compound MRF prior of (14), offers the advantage of easy construction, but it usually has an exponential calculation cost, , for the normalizing constant or, equivalently, the partition function, and this is an obstacle to direct calculation of the PM solution. The MAP solution has been used in work elsewhere because it is not affected by the normalizing constant. In contrast, the introduced causal type of prior of (8) has only a polynomial calculation cost , which enables us to successfully apply the variational Bayes method to this problem.
With regard to the hyperparameter priors, we also improved the existing method. As the edge penalty parameter , Kanemura et al.  implicitly assumed , which leads to a negative and consequently results in an edge-strewn image. We assumed by setting its prior according to a gamma distribution, resulting in an appropriate inference. As the smoothness parameter , they practically fixed the value of with a strongly informative prior. We chose a non-informative prior for . We show the box and whisker plot of the PM for each hyperparameter over experiments on each image under SNRdB noise in Fig. 5. As can be seen, the inferred value of the PM of showed wide variation, with an approximately 10-fold maximum-to-minimum ratio, depending on the original image. This result can be interpreted as meaning it is worth inferring in each HR image. Furthermore, and respectively showed approximately 2-fold and 4-fold ranges of variation. Regarding the contrast parameter , they assumed , which leads to , and this results in an improper normalizing constant. While we assume , which leads to a proper normalizing constant, we can consequently take the term of into account in the update equations of the variational Bayes.
With regard to the prior distribution for the blurring parameter , we used a Gaussian distribution even though is a positive real number. This is because we selected a simpler expression. We tried using the prior of the gamma distribution as , but the improvement was small. One disadvantage of this model is that a non-informative setting for this prior may lead to a nonsense result where the inferred is negative. Moreover, we employed a somewhat informative prior for . This is because the blurring parameter and smoothness hyperparameter are somewhat complementary. This means that simultaneous estimation of and is difficult. Tipping et al.  and Kanemura et al.  fixed , and Babacan et al.  fixed .
With regard to the estimator, we logically derived the optimal estimator PM from the objective function of the L2-norm-based PSNR. The widely used joint MAP estimator can be considered the optimal estimator for the all-or-none type objective function,
where is the Dirac delta or Kronecker delta function. Generally, this type of objective function is nonsensical for continuous variables because it is measure zero. If all the random variables in the posterior distribution are discrete, or if we can assume some smoothness of the posterior distribution, a joint MAP solution will have meaning. Instead of the L2-norm-based objective function of PSNR, the L1-norm (mean absolute error) -based PSNR is sometimes employed. In such cases, the median of the posterior distribution is generally the optimal estimator. In the case of the marginalized ML, or equivalently, type-II ML or empirical Bayes, for example, the registration parameters and other hyperparameters are firstly inferred as:
If these parameters have priors, such a method is called marginalized MAP. The HR image and sometimes the edge information are then inferred to as MAP,
or PM. For such a two-step inference, it is difficult to calculate back the objective function.
With regard to the Taylor approximation for the transformation matrix , we used the first-order approximation in (32) because it is more stable than the second-order approximation. This first-order approximation was proposed by Villena et al. . The second-order approximation was proposed by Pickup et al. , and they obtained good results. We also tried the second-order approximation, but it sometimes made the algorithm unstable because it sometimes failed to produce a positive definite matrix for the covariance matrix .
With regard to the Taylor approximation for and , we introduced the first-order approximation around and , respectively, in (IV-B) and (IV-B). Note that the Taylor expansion not with respect to , but with respect to is our key idea to solve the conjugate prior problem. Indeed, we could successfully derive the terms originating from in update equations ((52), (60), and (62) in Appendix). Kanemura et al. [6, 8] ignored the term of because of the high calculation cost, and this would result in less accurate inference. As for , we implicitly assumed that is not a binary vector but a continuous vector and did the differentiation. This assumption is based on (12). If we make another assumption – i.e., replacement of with in (12) – (12) has the same meaning, but the result of the Taylor approximation will differ from the current form.
With regard to the experimental results, the proposed method outperforms the other methods in terms of the ISNR for most images and noise levels. Moreover, its estimation of the registration parameters was more accurate than the other methods were for most conditions. Therefore, we conclude the proposed method is on the whole superior to the other methods. Compared with bilinear interpolation and Kanemura’s method, the superiority of the proposed method was clear. Compared with the Babacan’s method, the superiority of the proposed method was rather slight. Especially, in the case of the Pepper image in dB noise, the porposed method was worse than the Babacan’s method. This inferiority is considered to be caused by unstable estimation of and , where Babacan’s method fixed the value of to the true expected value in our implementation. Intuitively, the Pepper image is smoother than the other images and has fewer edges. Therefore, this feature is considered to be less preferable for complementary parameters of and .
With regard to the calculation cost, the proposed algorithm requires . This calculation cost order is given by two matrix inversions: in (55) and in (52) and (62) (see Appendix). We found that a simple approximation such as considering all the off-diagonal elements to be zero reduces the calculation time but obviously degrades accuracy. We hope to solve this problem in our future work.
In this paper, we proposed a Bayesian image super-resolution (SR) method with a causal Gaussian Markov random field (MRF) prior. We improved existing models with respect to three points: 1) the combined transformation model through a preferable normalization term using the elliptic theta function, 2) the causal Gaussian MRF model through introduction of a contrast parameter , which provides an effective normalizing constant including , and 3) the hyperparameter prior model through application of a gamma distribution for the edge penalty parameter , which prevents an unfavorable edge-strewn image. We then logically derived the optimal estimator, that is, not the joint maximum a posteriori (MAP) or marginalized maximum likelihood (ML) but the posterior mean (PM), from the objective function of the L2-norm (mean square error) -based peak signal-to-noise ratio (PSNR). The estimator is numerically determined by using variational Bayes. We solved the conjugate prior problem in variational Bayes by introducing three Taylor approximations. Other than these approximations, we did not use any approximations such as ignoring the term . Experimental results showed that the proposed method is mostly superior to existing methods in accuracy.
Here, we show the details of the variational Bayes’ update equations in Section IV-C.
The mean values of the hyperparameters over the trial distributions are given by
The update equation of is given as