 # Linear-Time Algorithm in Bayesian Image Denoising based on Gaussian Markov Random Field

In this paper, we consider Bayesian image denoising based on a Gaussian Markov random field (GMRF) model, for which we propose an new algorithm. Our method can solve Bayesian image denoising problems, including hyperparameter estimation, in O(n)-time, where n is the number of pixels in a given image. From the perspective of the order of the computational time, this is a state-of-the-art algorithm for the present problem setting. Moreover, the results of our numerical experiments we show our method is in fact effective in practice.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Bayesian image processing Geman and Geman (1984) based on a probabilistic graphical model has a long and rich history Blake et al. (2011)

. In Bayesian image processing, one constructs a posterior distribution and then infers restored images based on the posterior distribution. The posterior distribution is derived from a prior distribution that captures the statistical properties of the images. One of the major challenges of Bayesian image processing is the construction of an effective prior for the images. For this purpose, a Gaussian Markov random field (GMRF) model (or Gaussian graphical model) is a possible choice. Because a GMRF is a multi-dimensional Gaussian distribution, its mathematical treatment is tractable. GMRFs have also been applied in various research fields other than Bayesian image processing, e.g., traffic reconstruction

Kataoka et al. (2014), sparse modeling Banerjee et al. (2007), and earth science Kuwatani et al. (2014a, b).

In this paper, we focus on Bayesian image denoising based on GMRFs Nishimori (2000); Tanaka (2002)

. The procedure of Bayesian image denoising is divided into two stages: a hyperparameter-estimation stage and a restoring stage. In the hyperparameter-estimation stage, one estimates the optimal values of hyperparameters in posterior distribution for a given degraded image by using the expectation-maximization (EM) algorithm (or maximum marginal-likelihood estimation), while, in the restoring stage, one infers the restored image based on the posterior distribution with the optimal hyperparameters by using maximum a posteriori (MAP) estimation. However, the processes in the two stages have a computational problem. They require the an inverse operation of

covariance matrix, the computational time of which is , where is the number of pixels in the degraded image. Hence, in order to reduce the computational time, some approximations are employed. In the usual setting in Bayesian image denoising, the graph structure of the GMRF model is a square grid graph according to the configuration of the pixels. By adding the periodic boundary condition (PBC) to the square grid graph, the authors of Nishimori (2000); Tanaka (2002) constructed -time algorithms. We refer to this approximation as torus approximation

, because a square grid graph with the PBC is a torus graph. This approximation allows the graph Laplacian of the GMRF model to be diagonalized by using discrete Fourier transformation (DFT), and the diagonalization can reduce the computational time to

. More recently, it was further reduced to by using fast Fourier transformation (FFT) Morin (2010).

Recently, it was shown that the graph Laplacian of a square-grid GMRF model without the PBC can be diagonalized by using discrete cosine transformation (DCT) Fracastoro and Magli (2015); Fracastoro et al. (2017); Katakami et al. (2017). By using DCT, one can diagonalize the graph Laplacian without the torus approximation. Therefore, by combining the diagonalization based on DCT with the FFT method proposed in Ref Morin (2010), one can construct an -time algorithm without the torus approximation. The details of this algorithm are presented in D.

In this paper, we define a model for Bayesian image denoising based on GMRFs. The defined GMRF model has a more general form than the GMRF models presented in many previous studies, and it includes them as special cases. In our GMRF model, we propose an -time algorithm for practical Bayesian image denoising that uses DCT and the mean-field approximation. Since the mean-field approximation is equivalent to the exact computation in the present setting (see A), the results obtained by our method are equivalent to those obtained by exact computation. Our algorithm is state-of-the-art algorithm in the perspective of the order of the computational time. It is noteworthy that, in the present problem setting, one cannot create a sublinear-time algorithm, i.e., an algorithm having a computational time is less than , because at least -time computation is needed to create the restored image. In Tab. 1, the computational times of our method and the previous methods are shown.

The remainder of this paper is organized as follows. The model definition and the frameworks of Bayesian image denoising and of the EM algorithm for hyperparameters estimation are presented in Sect. II. The proposed method is shown in Sect. III. In Sect. IV, we show the validity of our method by the results of numerical experiments. Finally, conclusions are given in Sect. V.

## Ii Framework of Bayesian Image Denoising

Consider the problem of digital image denoising described below. Given an original image composed of pixels, a degraded image of the same size of the original one is generated by adding additive white Gaussian noise (AWGN) to the original one. Suppose that we independently obtain degraded images generated from the same stochastic process, where . The goal of Bayesian image denoising is to infer the original image from the given data, i.e., the

degraded images. We denote the original image by the vector

( expresses the intensity of the th pixel), where the image is vectorized by raster scanning (or row-major scanning) on the image, and is the number of pixels in the image; we denote the th degraded image by the vector . Note that the degraded images are also vectorized by the same procedure, i.e., raster scanning, as the original one. The framework of the presented Bayesian image denoising is illustrated in Fig. 1. Figure 1: Framework of Bayesian image denoising considered in this paper. P\mrmpri(\vex∣θ), P\mrmL(YK∣\vex,σ2), and P\mrmpost(\vex∣YK,Θ) are the prior distribution (Eq. (1)), the likelihood (Eq. (8)), and the posterior distribution (Eq. (9)), respectively. YK is the set of K degraded images.

We define the prior distribution of the original image by using a GMRF model, defined on a square grid graph with a free boundary condition, and express it as

 P\mrmpri(\vex∣θ):=1Z\mrmpri(θ)exp(−E\mrmpri(\vex;θ)) (1)

with the energy function

 E\mrmpri(\vex;θ) :=−b∑i∈Vxi+λ2∑i∈Vx2i\nn\aleq+α2∑{i,j}∈E(xi−xj)2, (2)

where and are the sets of vertices and undirected edges in , respectively, and is the set of the hyperparameters in the prior distribution: controls the brightness of the image,

controls the variance of the intensities of pixels, and

controls the smoothness of the image. is the partition function defined by

 Z\mrmpri(θ):=∫∞−∞exp(−E\mrmpri(\vex;θ))d\vex. (3)

The energy function of the prior distribution in Eq. (2) can be rewritten as

 E\mrmpri(\vex;θ)=−\veb\mrmt\vex+12\vex\mrmt\veS\mrmpri\vex, (4)

where is the -dimensional vector, the elements of which are all , , and

 \veS\mrmpri:=λ\veIn+α\veΛ (5)

is the precision matrix (the inverse of the covariance matrix) of the prior distribution, where is the

-dimensional identity matrix and

is the graph Laplacian of :

 Λi,j=⎧⎨⎩|∂(i)|i=j−1{i,j}∈E0\mrmothers, (6)

where is the set of vertices connected to vertex . By using Eq. (4) and the Gaussian integral, Eq. (3) yields

 Z\mrmpri(θ)=exp(12\veb\mrmt\veS−1\mrmpri\veb)√(2π)ndet\veS−1\mrmpri. (7)

It is noteworthy that the hyperparameter is important for the mathematical treatment for the prior distribution, because, when , the precision matrix is not positive definite and therefore Eq. (1) is not a well-defined probabilistic distribution.

Since we assume the AWGN, the likelihood, i.e., the generating process of the degraded images, for the degraded images is defined by

 P\mrmL(YK∣\vex,σ2)\nn :=K∏k=1∏i∈V1√2πσ2exp(−(xi−y(k)i)22σ2), (8)

where is the variance of the AWGN.

By using Eqs. (1) and (8) and Bayesian theorem

 P\mrmpost(\vex∣YK,Θ)∝P\mrmL(YK∣\vex,σ2)P\mrmpri(\vex∣θ),

where , the posterior distribution is expressed as

 P\mrmpost(\vex∣YK,Θ)\nn :=1Z\mrmpost(Θ)exp(−12σ2K∑k=1||\vex−\vey(k)||2\nn\aleq−E\mrmpri(\vex;b,λ,α)), (9)

where

 Z\mrmpost(Θ) :=∫∞−∞exp(−12σ2K∑k=1||\vex−\vey(k)||2\nn\aleq−E\mrmpri(\vex;b,λ,α))d\vex. (10)

Here, is the Euclidean norm of the assigned vector. Similarly to the prior distribution, the posterior distribution and its partition function in Eqs. (9) and (10) can be expressed as

 P\mrmpost(\vex∣YK,Θ)\nn =1Z\mrmpost(Θ)exp(−12σ2K∑k=1||\vey(k)||2+\vec\mrmt\vex\nn\aleq−12\vex\mrmt\veS\mrmpost\vex) (11)

and

 Z\mrmpost(Θ) =exp(−12σ2K∑k=1||\vey(k)||2+12\vec\mrmt\veS−1\mrmpost\vec)\nn\aleq×√(2π)ndet\veS−1\mrmpost, (12)

respectively, where

 \vec:=\veb+Kσ2^\vey, (13)

where

 ^\vey:=1KK∑k=1\vey(k) (14)

is the average image of degraded images, and

 \veS\mrmpost:=(λ+Kσ2)\veIn+α\veΛ (15)

is the precision matrix of the posterior distribution.

In previous work Nishimori (2000); Tanaka (2002); Morin (2010), and were treated as fixed constants, whereas, in this study, they were treated as controllable parameters, which are optimized through the EM algorithm as explained later. Moreover, the previous work considered only the case where the number of degraded images is just one, i.e., the case of , while the present formulation allows the treatment of multiple degraded images. For these reasons, our model is a generalization model that includes the models presented in the previous studies Nishimori (2000); Tanaka (2002); Morin (2010).

### ii.1 Maximum a Posteriori Estimation

In Bayesian image denoising, we have to infer the original image by using only the degraded images . In MAP estimation, we regard that , which maximize ,

 \vem(YK,Θ):=\argmax\vexP\mrmpost(\vex∣YK,Θ), (16)

is the most probable as the estimation of the original. Since the posterior distribution in Eq. (

9) is multivariate Gaussian distribution, the MAP estimate coincides with the mean vector of the posterior distribution:

 \vem(YK,Θ)=∫∞−∞\vexP\mrmpost(\vex∣YK,Θ)d\vex.

Naively, the order of the computational time of computing the mean vector of the posterior distribution in Eq. (16) is , because it includes the inverting operation for the precision matrix: .

We can obtain linear equations among by using mean-field approximation Wainwright and Jordan (2008) or loopy belief propagation Weiss and Freeman (2001). Both methods lead to the same expression:

 mi(YK,Θ) =1λ+α|∂(i)|+K/σ2(b+Kσ2^yi\nn\aleq+α∑j∈∂(i)mj(YK,Θ)). (17)

Eq. (17) can be solved by using a successive iteration method. It is known that, in a GMRF model, the mean vector evaluated by the mean-field approximation (or the loopy belief propagation) is equivalent to the exact one Weiss and Freeman (2001); Wainwright and Jordan (2008); Shental et al. (2008); see A. Therefore, the solution to Eq. (17) is equivalent to the exact mean vector of the posterior distribution in Eq. (16). The order of the computational time of solving Eq. (17) is , because the second sum in the right hand side of Eq. (17) includes at most four terms.

From the above, we see that, for a given and a fixed , Bayesian image denoising can be performed in -time. Obviously, however, the quality of the result should depend on . The stage of optimization of still remains. The EM algorithm, described in the following section, is one of the most popular methods for the optimization of .

### ii.2 Expectation-Maximization Algorithm

In the EM algorithm, for a fixed , we have to maximize the -function,

 Q(Θ;Θ\mrmold)\nn :=∫∞−∞P\mrmpost(\vex∣YK,Θ\mrmold)lnP\mrmjoint(\vex,YK∣Θ)d\vex, (18)

with respect to , where

is the joint distribution over

and . The gradients of the -function for the parameters in are

 ∇bQ(Θ;Θ\mrmold) =∑i∈VE\mrmpost[xi∣Θ\mrmold]\nn\aleq−∑i∈VE\mrmpri[xi∣θ], (19) ∇λQ(Θ;Θ\mrmold) =−12E\mrmpost[||\vex||2∣Θ\mrmold]\nn\aleq+12E\mrmpri[||\vex||2∣θ], (20) ∇σ2Q(Θ;Θ\mrmold) (21)

and

 ∇αQ(Θ;Θ\mrmold) =−12∑{i,j}∈EE\mrmpost[(xi−xj)2∣Θ\mrmold]\nn\aleq+12∑{i,j}∈EE\mrmpri[(xi−xj)2∣θ], (22)

where , and and are the expectation values of with respect to the prior distribution and the posterior distribution , respectively. Using a gradient ascent method with the gradients in Eqs. (19)–(22), we maximize the -function in Eq. (18). Naively, the computation of these gradients requires -time, which is rather expensive, because they include the inverting operation for the precision matrices, and .

The method presented in the next section allows the EM algorithm to be executed in a linear time. Therefore, given degraded images, we can perform Bayesian image denoising, including the optimization of by the EM algorithm, in -time.

## Iii Expectation-Maximization Algorithm in Linear Time

We now consider the free energies of the prior and the posterior distributions,

 F\mrmpri(θ) :=−lnZ\mrmpri(θ), (23) F\mrmpost(Θ) :=−lnZ\mrmpost(Θ). (24)

By using these free energies, the gradients in Eqs. (19)–(22) are rewritten as

 ∇bQ(Θ;Θ\mrmold) =∂F\mrmpri(θ)∂b−∂F\mrmpost(Θ\mrmpost)∂b, (25) ∇λQ(Θ;Θ\mrmold) =∂F\mrmpri(θ)∂λ−∂F\mrmpost(Θ\mrmold)∂λ\mrmold, (26) ∇σ2Q(Θ;Θ\mrmold) =−∂F\mrmpost(Θ\mrmold)∂σ2\mrmold−nK2σ2, (27)

and

 ∇αQ(Θ;Θ\mrmold)=∂F\mrmpri(θ)∂α−∂F\mrmpost(Θ\mrmold)∂α\mrmold. (28)

In the following sections, we analyze the two free energies, and , and their derivatives.

### iii.1 Prior Free Energy and Its Derivatives

From Eq. (7), the free energy of the prior distribution in Eq. (23) is expressed as

 F\mrmpri(θ)=−12\veb\mrmt\veS−1\mrmpri\veb+12lndet\veS\mrmpri−n2ln(2π). (29)

Since the GMRF model is defined on a square grid graph with the free boundary condition, its graph Laplacian can be diagonalized as Fracastoro and Magli (2015); Fracastoro et al. (2017); Katakami et al. (2017)

 \veΛ=\veU\veΦ\veU\mrmt. (30)

Here,

is the orthogonal matrix defined by

 \veU:=\veK⊗\veK, (31)

where is a (type-II) inverse discrete cosine transform (IDCT) matrix :

 Ki,j:=⎧⎪⎨⎪⎩1√vj=1√2vcos{π(j−1)v(i−12)}j≠1. (32)

is the diagonal matrix, the elements of which are the eigenvalues of

, and its diagonal vector, , is the vector obtained by the row-major-order vectorization of the matrix defined as

 ψi,j:=4sin2(π(i−1)2v)+4sin2(π(j−1)2v), (33)

i.e., . By using Eq. (30), Eq. (29) is rewritten as

 F\mrmpri(θ)=−12nb2λ+12∑i∈Vln(λ+αΦi)−n2ln(2π). (34)

The detailed derivation of this equation is shown in B.

From Eq. (34),

 ∂F\mrmpri(θ)∂b =−nbλ (35) ∂F\mrmpri(θ)∂λ =nb22λ2+12∑i∈V1λ+αΦi, (36) ∂F\mrmpri(θ)∂α =12∑i∈VΦiλ+αΦi (37)

are obtained. We can compute the right hand sides of Eqs. (35)–(37) in -time.

### iii.2 Posterior Free Energy and Its Derivatives

From Eqs. (12) and (24), the free energy of the posterior distribution is expressed as

 F\mrmpost(Θ) =12σ2K∑k=1||\vey(k)||2−12\vec\mrmt\veS−1\mrmpost\vec\nn\aleq+12lndet\veS\mrmpost−n2ln(2π). (38)

Because , it can be rewritten as

 F\mrmpost(Θ) =12σ2K∑k=1||\vey(k)||2−12\vec\mrmt\veS−1\mrmpost\vec\nn\aleq+12∑i∈Vln(λ+Kσ2+αΦi)−n2ln(2π). (39)

From the derivation presented in C, we obtain

 ∂F\mrmpost(Θ)∂b=−nb+n(K/σ2)^y\mrmaveλ+K/σ2, (40)

where is the average intensity over . By using the two equations

 ∂\veS−1\mrmpost∂γ=−\veS−1\mrmpost∂\veS\mrmpost∂γ\veS−1\mrmpost,

for , and , we obtain the derivatives of Eq. (39) as

 ∂F\mrmpost(Θ)∂λ =12||\vem(YK,Θ)||2\nn\aleq+12∑i∈V1λ+K/σ2+αΦi, (41) ∂F\mrmpost(Θ)∂σ2 =−12σ4K∑k=1||\vey(k)−\vem(YK,Θ)||2\nn\aleq−K2σ4∑i∈V1λ+K/σ2+αΦi, (42) ∂F\mrmpost(Θ)∂α =12∑{i,j}∈E(mi(YK,Θ)−mj(YK,Θ))2\nn\aleq+12∑i∈VΦiλ+K/σ2+αΦi. (43)

Since Eq. (17) can be solved in -time, we can also compute the right hand side of Eqs. (41)–(43) in -time. Note that, since , one can compute the first term in the right hand side of Eq. (43) in -time.

### iii.3 Proposed Method

From the results in Sects. III.1 and III.2, the gradients of the -function in Eqs. (25)–(28) can be rewritten as follows. From Eqs. (25), (35), and (40), we obtain

 ∇bQ(Θ;Θ\mrmold)=nb\mrmold+n(K/σ2\mrmold)^y\mrmaveλ\mrmold+K/σ2\mrmold−nbλ. (44)

From Eqs. (26), (36), and (41), we obtain

 ∇λQ(Θ;Θ\mrmold)=−12||\vem(YK,Θ\mrmold)||2\nn −12∑i∈V1λ\mrmold+K/σ2\mrmold+α\mrmoldΦi+b22λ2\nn +12∑i∈V1λ+αΦi. (45)

From Eqs. (27) and (42), we obtain

 ∇σ2Q(Θ;Θ\mrmold)=12σ4K∑k=1||\vem(YK,Θ\mrmold)−\vey(k)||2\nn +K2σ4∑i∈V1λ\mrmold+K/σ2\mrmold+α\mrmoldΦi−nK2σ2. (46)

Finally, from Eqs. (28), (37), and (43), we obtain

 ∇αQ(Θ;Θ\mrmold)\nn =−12∑{i,j}∈E(mi(YK,Θ\mrmold)−mj(YK,Θ\mrmold))2\nn −12∑i∈VΦiλ\mrmold+K/σ2\mrmold+α\mrmoldΦi+12∑i∈VΦiλ+αΦi. (47)

Note that in Eqs. (45)–(47) is the mean-vector of and is the solution to the mean-field equation in Eq. (17). Because the gradients are zero at the maximum point of , Eq. (46) becomes

 σ2 =1nKK∑k=1||\vem(YK,Θ\mrmold)−\vey(k)||2\nn\aleq+1n∑i∈V1λ\mrmold+K/σ2\mrmold+α\mrmoldΦi (48)

at that point. Since can be obtained in a linear time, we also obtain the right hand side of Eqs. (45)–(48) in a linear time.

The pseudo-code of the proposed procedure is shown in Algorithm 1. The EM algorithm contains a double-loop structure: the outer loop (Steps 4 to 17) and the inner loop (Steps 11 to 15), referred to as the M-step. In the M-step, we maximize with respect to for a fixed with a gradient ascent method with iterations, where , , and are the step rates in the gradient ascent method and the gradients are presented in Eqs. (44), (45), and (47). In Steps 5 to 8, we compute the mean-vector of posterior distributions by solving the mean-field equation in Eq. (17) with successive iteration methods with warm restarts. In practice, we can stop this iteration after a few times. In the experiment in the following section, this iteration was stopped after just one time, i.e., . In Step 10, , , and are initialized by , , and .

Since , the third term in Eq. (47) is approximated as

 12∑i∈VΦiλ+αΦi=12∑i∈V∖{1}Φiλ+αΦi≈n−12α, (49)

when . When and , Eq. (47) is therefore approximated as

 1α ≈1n−1(∑{i,j}∈E(mi(YK,Θ\mrmold)−mj(YK,Θ\mrmold))2\nn\aleq+∑i∈VΦiλ\mrmold+K/σ2\mrmold+α\mrmoldΦi). (50)

Hence, if the value of at the maximum point of is quite small, the value of obtained at that point is close to Eq. (50), and then, the initializing by using Eq. (50) in Step 10 in Algorithm 1 could facilitate for a faster convergence, when the optimal obtained by the EM algorithm is expected to be quite small. In fact, in the all of the numerical experiments in Sect. IV, the optimal values of obtained by the EM algorithm were quite small (). We thus used this initialization method for in the following numerical experiments.

When all the gradients in Eqs. (44), (45), and (47) are zero and , namely, when the EM algorithm converges, from Eq. (44). This means that vanishes during the EM algorithm when , which is the average intensity over the average image of degraded images, is zero.

## Iv Numerical Experiments Figure 2: Original 8-bit gray-scale images, the sizes of which are (a) 256×256, (b) 512×512, and (c) 1024×1024.

We demonstrate our method by experiments using the images shown in Fig. 2. The parameter setting of the following experiments was as follows. The step rates in the M-step were and . Our Bayesian system is quite sensitive for the value of , and therefore we set smaller than the others for the stability of the algorithm. The initial values of were , , , and . The number of iteration in the M-step, , was one Inoue and Tanaka (2007), and the condition of convergence in Step 17 in Algorithm 1 was with . The degraded images were centered (i.e., the average over all pixels in degraded images was shifted to zero) in the preprocessing. The maximum iteration number of the EM algorithm was 100.

The methods, including our method, used in the following experiments were implemented with a parallelized algorithm by using C++ with OpenMP library for the speed-up of the for-loops (for example, the summations over and in Eqs. (45), (47), and (48) and the for-loop for in Step 6 in Algorithm 1), and they were implemented on Microsoft Windows 8 (64 bit) with Intel(R) Core(TM) i7-4930K CPU (3.4 GHz) and RAM (32 GB).

### iv.1 Computation Time

We compared the computation time of our method with that of the