Analysis Operator Learning and Its Application to Image Reconstruction

Exploiting a priori known structural information lies at the core of many image reconstruction methods that can be stated as inverse problems. The synthesis model, which assumes that images can be decomposed into a linear combination of very few atoms of some dictionary, is now a well established tool for the design of image reconstruction algorithms. An interesting alternative is the analysis model, where the signal is multiplied by an analysis operator and the outcome is assumed to be the sparse. This approach has only recently gained increasing interest. The quality of reconstruction methods based on an analysis model severely depends on the right choice of the suitable operator. In this work, we present an algorithm for learning an analysis operator from training images. Our method is based on an ℓ_p-norm minimization on the set of full rank matrices with normalized columns. We carefully introduce the employed conjugate gradient method on manifolds, and explain the underlying geometry of the constraints. Moreover, we compare our approach to state-of-the-art methods for image denoising, inpainting, and single image super-resolution. Our numerical results show competitive performance of our general approach in all presented applications compared to the specialized state-of-the-art techniques.

There are no comments yet.

Authors

• 5 publications
• 23 publications
• 7 publications
• Online Adaptive Image Reconstruction (OnAIR) Using Dictionary Models

Sparsity and low-rank models have been popular for reconstructing images...
09/06/2018 ∙ by Brian E. Moore, et al. ∙ 0

• Novel Super-Resolution Method Based on High Order Nonlocal-Means

Super-resolution without explicit sub-pixel motion estimation is a very ...
03/14/2015 ∙ by Kang Yong-Rim, et al. ∙ 0

• Joint super-resolution image reconstruction and parameter identification in imaging operator: Analysis of bilinear operator equations, numerical solution, and application to ma

One important property of imaging modalities and related applications is...
04/27/2020 ∙ by Tobias Kluth, et al. ∙ 0

• PET Image Reconstruction with Multiple Kernels and Multiple Kernel Space Regularizers

Kernelized maximum-likelihood (ML) expectation maximization (EM) methods...
03/04/2021 ∙ by Shiyao Guo, et al. ∙ 0

• Geometry-Aware Neighborhood Search for Learning Local Models for Image Reconstruction

Local learning of sparse image models has proven to be very effective to...
05/06/2015 ∙ by Julio Cesar Ferreira, et al. ∙ 0

• Image reconstruction with imperfect forward models and applications in deblurring

We present and analyse an approach to image reconstruction problems with...
08/03/2017 ∙ by Yury Korolev, et al. ∙ 0

• Supervised Learning of Sparsity-Promoting Regularizers for Denoising

We present a method for supervised learning of sparsity-promoting regula...
06/09/2020 ∙ by Michael T. McCann, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Problem Description

Linear inverse problems are ubiquitous in the field of image processing. Prominent examples are image denoising [1], inpainting [2], super-resolution [3], or image reconstruction from few indirect measurements as in Compressive Sensing [4]. Basically, in all these problems the goal is to reconstruct an unknown image as accurately as possible from a set of indirect and maybe corrupted measurements with , see [5] for a detailed introduction to inverse problems. Formally, this measurement process can be written as

 y=As+e, (1)

where the vector

models sampling errors and noise, and is the measurement matrix modeling the sampling process. In many cases, reconstructing by simply inverting Equation (1) is ill-posed because either the exact measurement process and hence

is unknown as in blind image deconvolution, or the number of observations is much smaller compared to the dimension of the signal, which is the case in Compressive Sensing or image inpainting. To overcome the ill-posedness and to stabilize the solution, prior knowledge or assumptions about the general statistics of images can be exploited.

I-B Synthesis Model and Dictionary Learning

One assumption that has proven to be successful in image reconstruction, cf. [6], is that natural images admit a sparse representation over some dictionary with . A vector is called sparse when most of its coefficients are equal to zero or small in magnitude. When admits a sparse representation over , it can be expressed as a linear combination of only very few columns of the dictionary , called atoms, which reads as

 s=Dx. (2)

For , the dictionary is said to be overcomplete or redundant.

Now, using the knowledge that (2

) allows a sparse solution, an estimation of the original signal in (

1) can be obtained from the measurements by first solving

 x⋆=arg minx∈Rd g(x) subject to ∥ADx−y∥22≤ϵ, (3)

and afterwards synthesizing the signal from the computed sparse coefficients via . Therein, is a function that promotes or measures sparsity, and is an estimated upper bound on the noise power . Common choices for include the -norm

 ∥v∥pp:=∑i|vi|p, (4)

with and differentiable approximations of (4). As the signal is synthesized from the sparse coefficients, the reconstruction model (3) is called the synthesis reconstruction model [7].

To find the minimizer of Problem (3), various algorithms based on convex or non-convex optimization, greedy pursuit methods, or Bayesian frameworks exist that may employ different choices of . For a broad overview of such algorithms, we refer the interested reader to [8]. What all these algorithms have in common, is that their performance regarding the reconstruction quality severely depends on an appropriately chosen dictionary . Ideally, one is seeking for a dictionary where can be represented most accurately with a coefficient vector that is as sparse as possible. Basically, dictionaries can be assigned to two major classes: analytic dictionaries and learned dictionaries.

Analytic dictionaries are built on mathematical models of a general type of signal, e.g. natural images, they should represent. Popular examples include Wavelets [9], Bandlets[10], and Curvlets [11] among several others, or a concatenation of various such bases/dictionaries. They offer the advantages of low computational complexity and of being universally applicable to a wide set of signals. However, this universality comes at the cost of not giving the optimally sparse representation for more specific classes of signals, e.g. face images.

It is now well known that signals belonging to a specific class can be represented with fewer coefficients over a dictionary that has been learned using a representative training set, than over analytic dictionaries. This is desirable for various image reconstruction applications as it readily improves their performance and accuracy [12, 13, 14]. Basically, the goal is to find a dictionary over which a training set admits a maximally sparse representation. In contrast to analytic dictionaries, which can be applied globally to an entire image, learned dictionaries are small dense matrices that have to be applied locally to small image patches. Hence, the training set consists of small patches extracted from some example images. This restriction to patches mainly arises from limited memory, and limited computational resources.

Roughly speaking, starting from some initial dictionary the learning algorithms iteratively update the atoms of the dictionary, such that the sparsity of the training set is increased. This procedure is often performed via block-coordinate relaxation, which alternates between finding the sparsest representation of the training set while fixing the atoms, and optimizing the atoms that most accurately reproduce the training set using the previously determined sparse representation. Three conceptually different approaches for learning a dictionary became well established, which are probabilistic ones like [15], clustering based ones such as K-SVD [16], and recent approaches which aim at learning dictionaries with specific matrix structures that allow fast computations like [17]. For a comprehensive overview of dictionary learning techniques see [18].

I-C Analysis Model

An alternative to the synthesis model (3) for reconstructing a signal, is to solve

 s⋆=arg mins∈Rn g(Ωs) subject to ∥As−y∥22≤ϵ, (5)

which is known as the analysis model [7]. Therein, with is called the analysis operator, and the analyzed vector is assumed to be sparse, where sparsity is again measured via an appropriate function . In contrast to the synthesis model, where a signal is fully described by the non-zero elements of , in the analysis model the zero elements of the analyzed vector described the subspace containing the signal. To emphasize this difference, the term cosparsity has been introduced in [19], which simply counts the number of zero elements of . As the sparsity in the synthesis model depends on the chosen dictionary, the cosparsity of an analyzed signal depends on the choice of the analysis operator .

Different analysis operators proposed in the literature include the fused Lasso [20], the translation invariant wavelet transform [21]

, and probably best known the finite difference operator which is closely related to the total-variation

[22]. They all have shown very good performance when used within the analysis model for solving diverse inverse problems in imaging. The question is: Can the performance of analysis based signal reconstruction be improved when a learned analysis operator is applied instead of a predefined one, as it is the case for the synthesis model where learned dictionaries outperform analytic dictionaries? In [7], it has been discussed that the two models differ significantly, and the naïve way of learning a dictionary and simply employing its transposed or its pseudo-inverse as the learned analysis operator fails. Hence, different algorithms are required for analysis operator learning.

I-D Contributions

In this work, we introduce a new algorithm based on geometric optimization for learning a patch based analysis operator from a set of training samples, which we name GOAL (GeOmetric Analysis operator Learning). The method relies on a minimization problem, which is carefully motivated in Section II-B. Therein, we also discuss the question of what is a suitable analysis operator for image reconstruction, and how to antagonize overfitting the operator to a subset of the training samples. An efficient geometric conjugate gradient method on the so-called oblique manifold is proposed in Section III for learning the analysis operator. Furthermore, in Section IV we explain how to apply the local patch based analysis operator to achieve global reconstruction results. Section V sheds some light on the influence of the parameters required by GOAL and how to select them, and compares our method to other analysis operator learning techniques. The quality of the operator learned by GOAL on natural image patches is further investigated in terms of image denoising, inpainting, and single image super-resolution. The numerical results show the broad and effective applicability of our general approach.

I-E Notations

Matrices are written as capital calligraphic letters like , column vectors are denoted by boldfaced small letters e.g. , whereas scalars are either capital or small letters like . By we denote the element of the vector , denotes the element in the column of a matrix . The vector denotes the column of whereas denotes the transposed of the row of . By , we denote a matrix whose entry in the column is equal to one, and all others are zero.

denotes the identity matrix of dimension

,

denotes the zero-matrix of appropriate dimension, and

is the diagonal matrix whose entries on the diagonal are those of . By we denote the squared Frobenius norm of a matrix , is the trace of , and denotes the rank.

Ii Analysis Operator Learning

Ii-a Prior Art

The topic of analysis operator learning has only recently started to be investigated, and only few prior work exists. In the sequel, we shortly review analysis operator learning methods that are applicable for image processing tasks.

Given a set of training samples , the goal of analysis operator learning is to find a matrix with , which leads to a maximally cosparse representation of each training sample. As mentioned in Subsection I-B, the training samples are distinctive vectorized image patches extracted from a set of example images. Let be a matrix where the training samples constitute its columns, then the problem is to find

 Ω⋆=arg minΩ G(ΩS), (6)

where is subject to some constraints, and is some function that measures the sparsity of the matrix . In [23], an algorithm is proposed in which the rows of the analysis operator are found sequentially by identifying directions that are orthogonal to a subset of the training samples. Starting from a randomly initialized vector , a candidate row is found by first computing the inner product of with the entire training set, followed by extracting the reduced training set of samples whose inner product with is smaller than a threshold. Thereafter,

is updated to be the eigenvector corresponding to the smallest eigenvalue of

. This procedure is iterated several times until a convergence criterion is met. If the determined candidate vector is sufficiently distinctive from already found ones, it is added to as a new row, otherwise it is discarded. This process is repeated until the desired number of rows have been found.

An adaption of the widely known K-SVD dictionary learning algorithm to the problem of analysis operator learning is presented in [24]. As in the original K-SVD algorithm, is employed as the sparsifying function and the target cosparsity is required as an input to the algorithm. The arising optimization problem is solved by alternating between a sparse coding stage over each training sample while fixing using an ordinary analysis pursuit method, and updating the analysis operator using the optimized training set. Then, each row of is updated in a similar way as described in the previous paragraph for the method of [23]. Interestingly, the operator learned on piecewise constant image patches by [23] and [24] closely mimics the finite difference operator.

In [25], the authors use as the sparsity promoting function and suggest a constrained optimization technique that utilizes a projected subgradient method for iteratively solving (6). To exclude the trivial solution, the set of possible analysis operators is restricted to the set of Uniform Normalized Tight Frames, i.e. matrices with uniform row norm and orthonormal columns. The authors state that this algorithm has the limitation of requiring noiseless training samples whose analyzed vectors are exactly cosparse.

To overcome this restriction, the same authors propose an extension of this algorithm that simultaneously learns the analysis operator and denoises the training samples, cf. [26]. This is achieved by alternating between updating the analysis operator via the projected subgradient algorithm and denoising the samples using an Augmented Lagrangian method. Therein, the authors state that their results for image denoising using the learned operator are only slightly worse compared to employing the commonly used finite difference operator.

An interesting idea related to the analysis model, called Fields-of-Experts (FoE) has been proposed in [27]. The method relies on learning high-order Markov Random Field image priors with potential functions extending over large pixel neighborhoods, i.e. overlapping image patches. Motivated by a probabilistic model, they use the student-t distribution of several linear filter responses as the potential function, where the filters, which correspond to atoms from an analysis operator point of view, have been learned from training patches. Compared to our work and the methods explained above, their learned operator used in the experiments is underdetermined, i.e. , the algorithms only works for small patches due to computational reasons, and the atoms are learned independently, while in contrast GOAL updates the analysis operator as a whole.

Ii-B Motivation of Our Approach

In the quest for designing an analysis operator learning algorithm, the natural question arises: What is a good analysis operator for our needs? Clearly, given a signal that belongs to a certain signal class, the aim is to find an such that is as sparse as possible. This motivates to minimize the expected sparsity . All approaches presented in Subsection II-A can be explained in this way, i.e. for their sparsity measure they aim at learning an that minimizes the empirical mean of the sparsity over all randomly drawn training samples. This, however, does not necessarily mean to learn the optimal if the purpose is to reconstruct several signals belonging to a diverse class, e.g. natural image patches. The reason for this is that even if the expected sparsity is low, it may happen with high probability that some realizations of this signal class cannot be represented in a sparse way, i.e. that for a given upper bound , the probability exceeds a tolerable value, cf. Figure 1.

The algorithm presented here aims at minimizing the empirical expectation of a sparsifying function for all training samples , while additionally keeping the empirical variance moderate. In other words, we try to avoid that the analyzed vectors of many similar training samples become very sparse and consequently prevent from being adapted to the remaining ones. For image processing, this is of particular interest if the training patches are chosen randomly from natural images, because there is a high probability of collecting a large subset of very similar patches, e.g. homogeneous regions, that bias the learning process.

Concretely, we want to find an that minimizes both the squared empirical mean

 ¯¯¯g2=(1M∑ig(Ωsi))2 (7)

and the empirical variance

 s2=1M∑i(g(Ωsi)−¯¯¯g)2 (8)

of the sparsity of the analyzed vectors. We achieve this by minimizing the sum of both, which is readily given by

 ¯¯¯g2+s2=1M∑ig(Ωsi)2. (9)

Using , and introducing the factor the function we employ reads as

 Jp(V):=12MM∑j=1(1pk∑i=1|vij|p)2=12MM∑j=1(1p∥v:,j∥pp)2, (10)

with and .

Certainly, without additional prior assumptions on , the useless solution is the global minimizer of Problem (6). To avoid the trivial solution and for other reasons explained later in this section, we regularize the problem by imposing the following three constraints on .

1. The rows of have unit Euclidean norm, i.e.  for .

2. The analysis operator has full rank, i.e. .

3. The analysis operator does not have linear dependent rows, i.e.  for .

The rank condition (ii) on is motivated by the fact that different input samples with should be mapped to different analyzed vectors . With Condition (iii) redundant transform coefficients in an analyzed vector are avoided.

These constraints motivate the consideration of the set of full rank matrices with normalized columns, which admits a manifold structure known as the oblique manifold [28]

 OB(n,k):={X∈Rn×k|rk(X)=n,ddiag(X⊤X)=Ik}. (11)

Note, that this definition only yields a non-empty set if , which is the interesting case in this work. Thus, from now on, we assume . Remember that we require the rows of to have unit Euclidean norm. Hence, we restrict the transposed of the learned analysis operator to be an element of .

Since is open and dense in the set of matrices with normalized columns, we need a penalty function that ensures the rank constraint (ii) and prevents iterates to approach the boundary of .

Lemma 1

The inequality holds true for all , where .

Proof:

Due to the full rank condition on , the product is positive definite, consequently the strict inequality applies. To see the second inequality of Lemma 1, observe that

 ∥X∥2F=tr(XX⊤)=k, (12)

which implies . Since the trace of a matrix is equal to the sum of its eigenvalues, which are strictly positive in our case, it follows that the strict inequality holds true for all eigenvalues of

. From the well known relation between the arithmetic and the geometric mean we see

 n√Πλi≤1n∑λi. (13)

Now, since the determinant of a matrix is equal to the product of its eigenvalues, and with , we have

 det(1kXX⊤)=Πλi≤(1n)n, (14)

which completes the proof.

Recalling that and considering Lemma 1, we can enforce the full rank constraint with the penalty function

 h(Ω):=−1nlog(n)logdet(1kΩ⊤Ω). (15)

Regarding Condition (iii), the following result proves useful.

Lemma 2

For a matrix with , the inequality applies, where equality holds true if and only if .

Proof:

By the definition of the columns of are normalized, consequently Lemma 2 follows directly from the Cauchy-Schwarz inequality.

Thus, Condition (iii) can be enforced via the logarithmic barrier function of the scalar products between all distinctive rows of , i.e.

 r(Ω):=−∑1≤i

Finally, combining all the introduced constraints, our optimization problem for learning the transposed analysis operator reads as

 Ω⊤=arg minX∈OB(n,k)Jp(X⊤S)+κ h(X⊤)+μ r(X⊤). (17)

Therein, the two weighting factors control the influence of the two constraints on the final solution. The following lemma clarifies the role of .

Lemma 3

Let be a minimum of in the set of transposed oblique matrices, i.e.

 Ω⊤∈arg minX∈OB(n,k)h(X⊤), (18)

then the condition number of is equal to one.

Proof:

It is well known that equality of the arithmetic and the geometric mean in Equation (13) holds true, if and only if all eigenvalues of are equal, i.e. . Hence, if , then

, and consequently all singular values of

coincide. This implies that the condition number of , which is defined as the quotient of the largest to the smallest singular value, is equal to one.

With other words, the minima of are uniformly normalized tight frames, cf. [25, 26]. From Lemma 3 we can conclude that with larger the condition number of approaches one. Now, recall the inequality

 σmin∥s1−s2∥2≤∥Ω(s1−s2)∥2≤σmax∥s1−s2∥2, (19)

with being the smallest and being the largest singular value of . From this it follows that an analysis operator found with a large , i.e. obeying , carries over distinctness of different signals to their analyzed versions. The parameter regulates the redundancy between the rows of the analysis operator and consequently avoids redundant coefficients in the analyzed vector .

Lemma 4

The difference between any two entries of the analyzed vector is bounded by

 |ω⊤i,:s−ω⊤j,:s|≤√2(1−ω⊤i,:ωj,:) ∥s∥2. (20)
Proof:

From the Cauchy-Schwarz inequality we get

 |ω⊤i,:s−ω⊤j,:s|=|(ωi,:−ωj,:)⊤s|≤∥ωi,:−ωj,:∥2∥s∥2. (21)

Since by definition , it follows that .

The above lemma implies, that if the entry of the analyzed vector is significantly larger than then a large absolute value of prevents the entry to be small. To achieve large cosparsity, this is an unwanted effect that our approach avoids via the log-barrier function in (16). It is worth mentioning that the same effect is achieved by minimizing the analysis operator’s mutual coherence and that our experiments suggest that enlarging leads to minimizing the mutual coherence.

In the next section, we explain how the manifold structure of can be exploited to efficiently learn the analysis operator.

Iii Analysis Operator Learning Algorithm

Knowing that the feasible set of solutions to Problem (17) is restricted to a smooth manifold allows us to formulate a geometric conjugate gradient (CG-) method to learn the analysis operator. Geometric CG-methods have been proven efficient in various applications, due to the combination of moderate computational complexity and good convergence properties, see e.g. [29] for a CG-type method on the oblique manifold.

To make this work self contained, we start by shortly reviewing the general concepts of optimization on matrix manifolds. After that we present the concrete formulas and implementation details for our optimization problem on the oblique manifold. For an in-depth introduction on optimization on matrix manifolds, we refer the interested reader to [30].

Iii-a Optimization on Matrix Manifolds

Let be a smooth Riemannian submanifold of with the standard Frobenius inner product , and let be a differentiable cost function. We consider the problem of finding

 arg minX∈Mf(X). (22)

The concepts presented in this subsection are visualized in Figure 2 to alleviate the understanding.

To every point one can assign a tangent space . The tangent space at is a real vector space containing all possible directions that tangentially pass through . An element is called a tangent vector at . Each tangent space is associated with an inner product inherited from the surrounding , which allows to measure distances and angles on .

The Riemannian gradient of at is an element of the tangent space that points in the direction of steepest ascent of the cost function on the manifold. As we require to be a submanifold of and since by assumption is defined on the whole , the Riemannian gradient is simply the orthogonal projection of the (standard) gradient onto the tangent space . In formulas, this reads as

 G(X):=ΠTXM(∇f(X)). (23)

A geodesic is a smooth curve emanating from in the direction of , which locally describes the shortest path between two points on . Intuitively, it can be interpreted as the equivalent of a straight line in the manifold setting.

Conventional line search methods search for the next iterate along a straight line. This is generalized to the manifold setting as follows. Given a current optimal point and a search direction at the iteration, the step size which leads to sufficient decrease of can be determined by finding the minimizer of

 (24)

Once has been determined, the new iterate is computed by

 X(i+1)=Γ(X(i),H(i),α(i)). (25)

Now, one straightforward approach to minimize is to alternate Equations (23), (24), and (25) using , with the short hand notation , which corresponds to the steepest descent on a Riemannian manifold. However, as in standard optimization, steepest descent only has a linear rate of convergence. Therefore, we employ a conjugate gradient method on a manifold, as it offers a superlinear rate of convergence, while still being applicable to large scale optimization problems with low computational complexity.

In CG-methods, the updated search direction is a linear combination of the gradient and the previous search direction . Since adding vectors that belong to different tangent spaces is not defined, we need to map from to . This is done by the so-called parallel transport , which transports a tangent vector along the geodesic to the tangent space . Now, using the shorthand notation

 T(i+1)Ξ:=T(Ξ,X(i),H(i),α(i)), (26)

the new search direction is computed by

 H(i+1)=−G(i+1)+β(i)T(i+1)H(i), (27)

where is calculated by some update formula adopted to the manifold setting. Most popular are the update formulas by Fletcher-Reeves (FR), Hestenes-Stiefel (HS), and Dai-Yuan (DY). With , they read as

 β(i)FR =⟨G(i+1),G(i+1)⟩⟨G(i),G(i)⟩, (28) β(i)HS =⟨G(i+1),Y(i+1)⟩⟨T(i+1)H(i),Y(i+1)⟩, (29) β(i)DY =⟨G(i+1),G(i+1)⟩⟨T(i+1)H(i),Y(i+1)⟩. (30)

Now, a solution to Problem (22) is computed by alternating between finding the search direction on and updating the current optimal point until some user-specified convergence criterion is met, or a maximum number of iterations has been reached.

Iii-B Geometric Conjugate Gradient for Analysis Operator Learning

In this subsection we derive all ingredients to implement the geometric conjugate gradient method as described in the previous subsection for the task of learning the analysis operator. Results regarding the geometry of are derived e.g. in [30]. To enhance legibility, and since the dimensions and are fixed throughout the rest of the paper, the oblique manifold is further on denoted by .

The tangent space at is given by

 TXOB={Ξ∈Rn×k|ddiag(X⊤Ξ)=0}. (31)

The orthogonal projection of a matrix onto the tangent space is

 ΠTXOB(Q)=Q−Xddiag(X⊤Q). (32)

Regarding geodesics, note that in general a geodesic is the solution of a second order ordinary differential equation, meaning that for arbitrary manifolds, its computation as well as computing the parallel transport is not feasible. Fortunately, as the oblique manifold is a Riemannian submanifold of a product of

unit spheres , the formulas for parallel transport and the exponential mapping allow an efficient implementation.

Let be a point on a sphere and be a tangent vector at , then the geodesic in the direction of is a great circle

 γ(x,h,t) ={x,if ∥h∥2=0xcos(t∥h∥2)+hsin(t∥h∥2)∥h∥2,otherwise. (33)

The associated parallel transport of a tangent vector along the great circle reads as

 τ(ξ,x,h,t)=ξ−ξ⊤h∥h∥22( x∥h∥2sin(t∥h∥2)+ h(1−cos(t∥h∥2))). (34)

As is a submanifold of the product of unit spheres, the geodesic through in the direction of is simply the combination of the great circles emerging by concatenating each column of with the corresponding column of , i.e.

 Γ(X,H,t)=[γ(x:,1,h:,1,t),…,γ(x:,k,h:,k,t)]. (35)

Accordingly, the parallel transport of along the geodesic is given by

 T(Ξ,X,H,t)=[τ(ξ:,1,x:,1,h:,1,t),…,τ(ξ:,k,x:,k,h:,k,t)]. (36)

Now, to use the geometric CG-method for learning the analysis operator, we require a differentiable cost function . Since, the cost function presented in Problem (17) is not differentiable due to the non-smoothness of the -pseudo-norm (10), we exchange Function (10) with a smooth approximation, which is given by

 Jp,ν(V):=12MM∑j=1(1pk∑i=1(v2ij+ν)p2)2, (37)

with being the smoothing parameter. The smaller is, the more closely the approximation resembles the original function. Again, taking and with the shorthand notation , the gradient of the applied sparsity promoting function (37) reads as

 ∂∂ΩJp,ν(ΩS)= [1MM∑j=11pk∑i=1(z2ij+ν)p2 k∑i=1{zij(z2ij+ν)p2−1Eij}]S⊤. (38)

The gradient of the rank penalty term (15) is

 ∂∂Ωh(Ω)=−2knlog(n)Ω(1kΩ⊤Ω)−1 (39)

and the gradient of the logarithmic barrier function (16) is

 (40)

Combining Equations (III-B), (39), and (40), the gradient of the cost function

 f(X):=Jp,ν(X⊤S)+κ h(X⊤)+μ r(X⊤) (41)

which is used for learning the analysis operator reads as

 ∇f(X)=∂∂XJp,ν(X⊤S)+κ∂∂Xh(X⊤)+μ∂∂Xr(X⊤). (42)

Regarding the CG-update parameter , we employ a hybridization of the Hestenes-Stiefel Formula (29) and the Dai Yuan formula (30)

 β(i)hyb=max(0,min(β(i)%DY,β(i)HS)), (43)

which has been suggested in [31]. As explained therein, formula (43) combines the good numerical performance of HS with the desirable global convergence properties of DY.

Finally, to compute the step size , we use an adaption of the well-known backtracking line search to the geodesic . In that, an initial step size is iteratively decreased by a constant factor until the Armijo condition is met, see Algorithm 1 for the entire procedure.

In our implementation we empirically chose and . As an initial guess for the step size at the first CG-iteration , we choose

 t(0)0=∥G(0)∥−1F, (44)

as proposed in [32]. In the subsequent iterations, the backtracking line search is initialized by the previous step size divided by the line search parameter, i.e. . Our complete approach for learning the analysis operator is summarized in Algorithm 2. Note, that under the conditions that the Fletcher-Reeves update formula is used and some mild conditions on the step-size selection, the convergence of Algorithm 2 to a critical point, i.e. , is guaranteed by a result provided in [33].

Iv Analysis Operator based Image Reconstruction

In this section we explain how the analysis operator is utilized for reconstructing an unknown image from some measurements following the analysis approach (5). Here, the vector denotes a vectorized image of dimension , with being the width and being the height of the image, respectively, obtained by stacking the columns of the image above each other. In the following, we will loosely speak of as the image.

Remember, that the size of is very small compared to the size of the image, and it has to be applied locally to small image patches rather than globally to the entire image. Artifacts that arise from naïve patch-wise reconstruction are commonly reduced by considering overlapping patches. Thereby, each patch is reconstructed individually and the entire image is formed by averaging over the overlapping regions in a final step. However, this method misses global support during the reconstruction process, hence, it leads to poor inpainting results and is not applicable for e.g. Compressive Sensing tasks. To overcome these drawbacks, we use a method related to the patch based synthesis approach from [12] and the method used in [27], which provides global support from local information. Instead of optimizing over each patch individually and combining them in a final step, we optimize over the entire image demanding that a pixel is reconstructed such that the average sparsity of all patches it belongs to is minimized. When all possible patch positions are taken into account, this procedure is entirely partitioning-invariant. For legibility, we assume square patches i.e. of size ) with being a positive integer.

Formally, let and denote sets of indices with , and . Therein, determine the degree of overlap between two adjacent patches in vertical, and horizontal direction, respectively. We consider all image patches whose center is an element of the cartesian product set . Hence, with denoting the cardinality of a set, the total number of patches being considered is equal to . Now, let be a binary matrix that extracts the patch centered at position . With this notation, we formulate the (global) sparsity promoting function as

 ∑r∈r∑c∈ck∑i=1((Ω⋆Prcs)2i+ν)p2, (45)

which measures the overall approximated -pseudo-norm of the considered analyzed image patches. We compactly rewrite Equation (45) as

 g(ΩFs):=K∑i=1((ΩFs)2i+ν)p2, (46)

with and

 (47)

being the global

analysis operator that expands the patch based one to the entire image. We treat image boundary effects by employing constant padding, i.e. replicating the values at the image boundaries

times, where denotes rounding to the smaller integer. Certainly, for image processing applications is too large for being applied in terms of matrix vector multiplication. Fortunately, applying and its transposed can be implemented efficiently using sliding window techniques, and the matrix vector notation is solely used for legibility.

According to [34], we exploit the fact that the range of pixel intensities is limited by a lower bound and an upper bound . We enforce this bounding constraint by minimizing the differentiable function , where is a penalty term given as

 b(s)=⎧⎪ ⎪⎨⎪ ⎪⎩|s−bu|2 if s≥bu|s−bl|2 if s≤bl0 otherwise. (48)

Finally, combining the two constraints (46) and (48) with the data fidelity term, the analysis based image reconstruction problem is to solve

 s⋆=arg mins∈RN12∥As−y∥22+b(s)+λg(ΩFs). (49)

Therein, balances between the sparsity of the solution’s analysis coefficients and the solution’s fidelity to the measurements. The measurement matrix and the measurements are application dependent.

V Evaluation and Experiments

The first part of this section aims at answering the question of what is a good analysis operator for solving image reconstruction problems and relates the quality of an analysis operator with its mutual coherence and its condition number. This, in turn allows to select the optimal weighting parameters and for GOAL. Using this parameters, we learn one general analysis operator by GOAL, and compare its image denoising performance with other analysis approaches. In the second part, we employ this unaltered for solving two classical image reconstruction tasks of image inpainting and single image super-resolution, and compare our results with the currently best analysis approach FoE [27], and state-of-the-art methods specifically designed for each respective application.

V-a Global Parameters and Image Reconstruction

To quantify the reconstruction quality, as usual, we use the peak signal-to-noise ratio . Moreover, we measure the quality using the Mean Structural SIMilarity Index (MSSIM) [35], with the same set of parameters as originally suggested in [35]. Compared to PSNR, the MSSIM better reflects a human observer’s visual impression of quality. It ranges between zero and one, with one meaning perfect image reconstruction.

Throughout all experiments, we fixed the size of the image patches to , i.e. . This is in accordance to the patch-sizes mostly used in the literature, and yields a good trade-off between reconstruction quality and numerical burden. Images are reconstructed by solving the minimization problem (49) via the conjugate gradient method proposed in [34]. Considering the pixel intensity bounds, we used and , which is the common intensity range in -bit grayscale image formats. The sparsity promoting function (46) with and is used for both learning the analysis operator by GOAL, and reconstructing the images. Our patch based reconstruction algorithm as explained in Section IV achieves the best results for the maximum possible overlap . The Lagrange multiplier and the measurements matrix depend on the application, and are briefly discussed in the respective subsections.

V-B Analysis Operator Evaluation and Parameter Selection

For evaluating the quality of an analysis operator and for selecting appropriate parameters for GOAL, we choose image denoising as the baseline experiment. The images to be reconstructed have artificially been corrupted by additive white Gaussian noise (AWGN) of varying standard deviation

. This baseline experiment is further used to compare GOAL with other analysis operator learning methods. We like to emphasize that the choice of image denoising as a baseline experiment is not crucial neither for selecting the learning parameters, nor for ranking the learning approaches. In fact, any other reconstruction task as discussed below leads to the same parameters and the same ranking of the different learning algorithms.

For image denoising, the measurement matrix in Equation (49) is the identity matrix. As it is common in the denoising literature, we assume the noise level to be known and adjust accordingly. From our experiments, we found that is a good choice. We terminate our algorithm after iterations depending on the noise level, i.e. the higher the noise level is the more iterations are required. To find an optimal analysis operator, we learned several operators with varying values for , and and fixed all other parameters according to Subsection V-A. Then, we evaluated their performance for the baseline task, which consists of denoising the five test images, each corrupted with the five noise levels as given in Table I. As the final performance measure we use the average PSNR of the 25 achieved results. The training set consisted of image patches, each normalized to unit Euclidean norm, that have randomly been extracted from the five training images shown in Figure 3

. Certainly, these images are not considered within any of the performance evaluations. Each time, we initialized GOAL with a random matrix having normalized rows. Tests with other initializations like an overcomplete DCT did not influence the final operator.

The results showed, that our approach clearly benefits from over-completeness. The larger we choose , the better the operator performs with saturation starting at . Therefore, we fixed the number of atoms for all further experiments to . Regarding and , note that by Lemma 3 and 4 these parameters influence the condition number and the mutual coherence of the learned operator. Towards answering the question of what is a good condition number and mutual coherence for an analysis operator, Figure 4LABEL:sub@fig:heatmap shows the relative denoising performance of 400 operators learned by GOAL in relation to their mutual coherence and condition. We like to mention that according to our experiments, this relation is mostly independent from the degree of over-completeness. It is also interesting to notice that the best learned analysis operator is not a uniformly tight frame. The concrete values, which led to the best performing analysis operator in our experiments are and . Its singular values are shown in Figure 4LABEL:sub@fig:svds and its atoms are visualized in Figure 5. This operator remains unaltered throughout all following experiments in Subsections V-CV-E.

V-C Comparison with Related Approaches

The purpose of this subsection is to rank our approach among other analysis operator learning methods, and to compare its performance with state-of-the-art denoising algorithms. Concretely, we compare the denoising performance using learned by GOAL with total-variation (TV) [36] which is the currently best known analysis operator, with the recently proposed method AOL [25], and with the currently best performing analysis operator FoE [27]. Note that we used the same training set and dimensions for learning the operator by AOL as for GOAL. For FoE we used the same setup as originally suggested by the authors. Concerning the required computation time for learning an analysis operator, for this setting GOAL needs about -minutes on an Intel Core i7 3.2 GHz quad-core with 8GB RAM. In contrast, AOL is approximately ten times slower, and FoE is the computationally most expensive method requiring several hours. All three methods are implemented in unoptimized Matlab code.

The achieved results for the five test images and the five noise levels are given in Table I. Our approach achieves the best results among the analysis methods both regarding PSNR, and MSSIM. For a visual assessment, Figure 6 exemplarily shows some denoising results achieved by the four analysis operators.

To judge the analysis methods’ denoising performance globally, we additionally give the results achieved by current state-of-the-art methods BM3D [37] and K-SVD Denoising [12], which are specifically designed for the purpose of image denoising. In most of the cases our method performs slightly better than the K-SVD approach, especially for higher noise levels, and besides of the "barabara" image it is at most dB worse than BM3D. This effect is due to the very special structure of the "barbara" image that rarely occurs in natural images, which are smoothed by the learned operator.

V-D Image Inpainting

In image inpainting as originally proposed in [2], the goal is to fill up a set of damaged or disturbing pixels such that the resulting image is visually appealing. This is necessary for the restoration of damaged photographs, for removing disturbances caused by e.g. defective hardware, or for deleting unwanted objects. Typically, the positions of the pixels to be filled up are given a priori. In our formulation, when pixels must be inpainted, this leads to a binary dimensional measurements matrix , where each row contains exactly one entry equal to one. Its position corresponds to a pixel with known intensity. Hence, reflects the available image information. Regarding , it can be used in a way that our method simultaneously inpaints missing pixels and denoises the remaining ones.

As an example for image inpainting, we disturbed some ground-truth images artificially by removing pixels randomly distributed over the entire image as exemplary shown in Figure 7LABEL:sub@fig:plena10. In that way, the reconstruction quality can be judged both visually and quantitatively. We assumed the data to be free of noise, and empirically selected . In Figure 7, we show exemplary results for reconstructing the "lena" image from of all pixels using GOAL, FoE, and the recently proposed synthesis based method [38]. Table II gives a comparison of further images and further number of missing pixels. It can be seen that our methods performs best independent of the configuration.

V-E Single Image Super-Resolution

In single image super-resolution (SR), the goal is to reconstruct a high resolution image from an observed low resolution image . In that, is assumed to be a blurred and downsampled version of . Mathematically, this process can be formulated as where is a decimation operator and