Bilevel Integrative Optimization for Ill-posed Inverse Problems

07/06/2019 ∙ by Risheng Liu, et al. ∙ Dalian University of Technology The University of Hong Kong 1

Classical optimization techniques often formulate the feasibility of the problems as set, equality or inequality constraints. However, explicitly designing these constraints is indeed challenging for complex real-world applications and too strict constraints may even lead to intractable optimization problems. On the other hand, it is still hard to incorporate data-dependent information into conventional numerical iterations. To partially address the above limits and inspired by the leader-follower gaming perspective, this work first introduces a bilevel-type formulation to jointly investigate the feasibility and optimality of nonconvex and nonsmooth optimization problems. Then we develop an algorithmic framework to couple forward-backward proximal computations to optimize our established bilevel leader-follower model. We prove its convergence and estimate the convergence rate. Furthermore, a learning-based extension is developed, in which we establish an unrolling strategy to incorporate data-dependent network architectures into our iterations. Fortunately, it can be proved that by introducing some mild checking conditions, all our original convergence results can still be preserved for this learnable extension. As a nontrivial byproduct, we demonstrate how to apply this ensemble-like methodology to address different low-level vision tasks. Extensive experiments verify the theoretical results and show the advantages of our method against existing state-of-the-art approaches.



There are no comments yet.


page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inverse problems naturally occur in almost all the areas of science and engineering [1, 2]. This problem can be formulated using an analytical description

of the forward operator in some given vector spaces


. For example, typical tasks in computer vision that can be phrased as inverse problems include image deblurring, where

is the blur kernel, super-resolution, where

downsamples high-resolution images, or image inpainting, where

is given by a mask corresponding to the inpainting domain. In medical imaging, common forward operators

are the Fourier transform in MRI and the ray transform in CT. In this way, the target of inverse problems is to obtain an unknown

from the given noisy observation . However, in most real-world scenarios, the above problem is always ill-posed, that is, the solution either not exists, is not unique or does not depend continuously on the datum . Therefore, the scope of this work is to investigate the so-called Ill-posed Inverse Problems (IIPs for short). In the past years, a variety of approaches have been proposed to address IIPs.

The first category of methods aim to find the maximum likelihood solution, resulted in the minimization of the data discrepancy [3, 4]. To penalize the unfeasible solutions, different variational regularizations have been developed to encode prior information about and they actually seek to minimize the regularized objective functionals [5]. These energy minimization approaches can be applied to solve various types of IIPs, but they often have poorer performance on an individual task. This is mainly because it is very hard to design correct priors to appropriately fit our desired solutions in real-world problems.

A more recent trend is to establish regression-type neural networks to learn a direct inverse mapping

from the observation space to the solution space [6, 7, 8]

. Such methods have achieved impressive performance in some practical applications. While the obvious disadvantages with these learned deep approaches are that their performances completely depend on rich high-quality training data and it is also challenging to learn a single network to address various different tasks. Theoretically, it is extremely hard to understand and interpret these complex network structures, due to their heuristic nature. Very recently, preliminary works have begun to link conventional approaches with task-specific computational modules (e.g., deep learning architectures) and developed a series of optimization-inspired learning methods to solve IIPs

[9, 10]. The idea is to unroll an existing optimization process and replace the explicit iterative updating rule with hand-designed operators and/or learned architectures [11]. Unfortunately, due to the uncontrolled inexact computational modules, it is hard to theoretically guarantee the convergence of these methods. The works in [10, 12, 13] tried to introduce error control rules to correct improper modules and thus prove the convergence of their trained iterations. However, these additional error checking process will slow down the particular computation when handling challenging problems. Moreover, deeper theoretical investigations (e.g., the stability of the iterations) are still missing.

1.1 Our Contributions

In this work, we propose a completely new paradigm, named Bilevel Integrative Optimization (BIO), to formulate and optimize IIPs in knowledge and data collaboration manner. Different from traditional variational modeling techniques, which only consider a single energy formulation, we introduce a specific bilevel system111Please notice that our BIO formulation is actually not a standard bilevel programming with coupled variables, but a specific case with one single variable [14].to combine principled energies, hand-designed priors and even deeply trained network architectures to formulate IIPs. Since in BIO we consider general convex composite (not necessarily smooth) energies in both upper and lower subproblems, to our best knowledge, no existing bilevel optimization algorithms can be used to address the resulted optimization problem. Fortunately, by re-characterizing the latent feasibility as explicit set constraints, an efficient iteration scheme is developed to address BIO. Theoretically, we can prove that our established iteration scheme can strictly converge to the global optimal solution of BIO. We also show that the computational errors during our practical iterations can be successfully dominated, thus verify the stability of the proposed method. Extensive experiments on real-world computer vision applications finally demonstrate the efficiency and effectiveness of the proposed framework. In summary, our contributions mainly include:

  • As a new modeling paradigm, BIO provides a flexible and modularized framework to integrate principled energies, hand-designed priors and deeply trained network architectures to formulate IIPs.

  • By investigating the stability of the nested lower-level subproblem, the validity of our latent feasibility re-characterization based single level reformulation scheme is strictly convinced.

  • Different from existing convex bilevel methods (e.g., [15, 16, 17]), which is only applicable for strongly convex upper-level subproblem, our solution scheme can successfully handle models with more general convex upper-level objectives.

2 Related Work

2.1 Existing Approaches for Inverse Problems

The most widely used approaches in solving IIPs are to minimize the mass-fit against data, i.e., where the generalized inverse is unbounded. Unfortunately, these purely analytic models are typically just an approximation to these real-world tasks. Moreover, given no presumption about the nature of the solution (e.g., the statistics of the image), it is virtually impossible to solve the IIPs, especially in real scenarios.

Over the past decades, one of the most widely used way to approach IIPs is by defining a likelihood and prior, and optimizing for the Maximum A Posteriori (MAP) solution. The sparsity of natural images is commonly utilized for defining the priors in some primitive works [3]. The nonlocal self-similarity is then developed for constructing the priors [18]. As further research, researchers tend to design complicated priors for the competitive performance [4, 5]. While more expressive priors allow for better representation of the signal of interest, they will typically make optimization more difficult. In fact, only for a few trivial prior-likelihood pairs will inference remain convex. In practice, one often has to resort to approximations of the objective and to approximate double-loop algorithms in order to allow for scalable inference.

The huge field of machine learning provides several data-driven approaches to tackle these problems by using training datasets to either construct a problem adapted forward operator and use an established inversion method or to solve the inverse problem directly 

[19, 20]. In particular, deep learning approaches using neural networks with multiple internal layers have become popular over the recent few years [21]. The development of deep learning lies in the massive data [22] and the significant network architecture [23]. Nowadays, the large scale of data has been seen everywhere. Consequently, the design of architecture plays a decisive role to push forward deep learning. Indeed, the big advancements of deep learning owing to some targeted design of architecture [24, 8]. There is no denying that the tremendous success of deep learning has achieved in many fields. But it is a vital shortcoming that the distribution of training set directly decide the performance in test data. Moreover, no consistent mathematical theory on deep neural networks for inverse problems has been developed yet besides the stunning experimental results, which have been published so far for many different types of applications to inverse problems.

Some recent works put more emphasis on building the connection between with traditional optimization and learning architecture. One is to design the complex priors with learning parameters and unrolling the iteration scheme to obtain the end-to-end learned architecture [25, 26]. The other is to substitute the subproblem with the data-driven network in the optimization process [27, 11]. To be brief, the key intentions of such methods lie in how to derive the learned architecture from the optimization model to incorporate the data prior. But unfortunately, the convergence behaviors and stability properties become difficult to analyze although the performances indeed bring the advancements, which loses the strengths of traditional optimization. There exists some works [10] have presented the convergence analysis by introducing the error control rules, which bring the extra computational burden and the theoretical results are not satisfied.

2.2 Bilevel Optimization Techniques

In general, bilevel models are hierarchical optimization problems, where the feasible region of the upper-level subproblem is restricted by the graph of the solution set mapping of the lower-level subproblem, see, e.g.,  [28]. Due to the coupling of the hierarchical subproblems, it is indeed extremely challenging to optimize classic bilevel problems. Recently, a series of works [15, 16, 17] have tried to address specific bilevel models, in which they minimize a convex function over the set of constrained minimizers of another convex function. Although upper-level variables exactly coincide with lower-level variables, the resulted convex bilevel problems are still extraordinarily ill-posed. The early work in [15] considered this category of models with both smooth upper and lower subproblems. Recent works in [16, 17] introduced minimal norm gradient and sequential averaging methods to address convex bilevel problems with strongly convex and smooth upper-level subproblem. Unfortunately, the problem settings in [15, 16, 17] are too restrictive for our tasks. Thus neither of these schemes is amenable as a solution method for the optimization problem in this work.

3 The Proposed Framework

In this section, we establish a generic convex bilevel framework to integrate different categories of information to formulate and optimize challenging inverse problems222We will demonstrate how to apply our framework to integrate both knowledge and data for real-world applications in Sec. 5..

3.1 Bilevel Integration Modeling

We first consider a composite optimization formulation:


where , are extended-valued convex functions and may be possibly nonsmooth. The model in Eq. (1

) has drawn increasing attention recently for the emerging applications in leaning and vision areas, e.g., face recognition 

[29], saliency detection [30], image restoration [10], optical flows [25] and many others [31]. While different from these existing approaches, which just formulate their objectives as specific forms of Eq. (1), here we would like to consider the solution set of Eq. (1) as the lower-level latent feasibility of our task and actually aim to solve the following bilevel problem with hierarchy


where we introduce another composite function as our upper-level objective. Similarly, we also assume that both , are also extended-valued convex functions and may be possibly nonsmooth. In this way, we actually obtain a convex bilevel optimization model, which implicitly integrate two different hierarchies of information (i.e., and ) for IIPs modeling.

3.2 A New Single Level Reformulation Strategy

From an optimization perspective, we actually utilize the lower-level subproblem Eq. (1) to characterize the feasible region of BIO in Eq. (2). The main benefit of such strategy is that we can take full advantage of our domain knowledge of the task. Indeed, Eq. (2) can be recognized as a specific convex bilevel model. However, both upper and lower levels are in general lack of smoothness and strong convexity. Thus the existing solution schemes [15, 16, 17] no longer admit any theoretical validity. In fact, when the upper-level subproblem is in the absence of strong convexity, directly solving Eq. (2) with such latent feasibility is extremely challenging.

Motivated by the observation that in applications, the latent feasibility in Eq. (1) usually possesses some underlying structures, in this paper, we develop a new optimization strategy with solid theoretical guarantees for structural BIO. In particular, we re-characterize the latent feasibility in Eq. (1). Therefore we shall reformulate Eq. (2) in terms of the re-characterization of Eq. (1) into a standard optimization problem which is numerically trackable. To this end, we first list some structural assumptions, which are necessary for our following analysis and easily to satisfy in applications of practical interests. Specifically, throughout this present paper, we suppose that takes the form that , where is some given linear operator and has the following structural properties333

Some commonly used functions in learning and vision (e.g., linear regression

, logistic regression

and likelihood estimation under Poisson noise ) automatically satisfy these assumptions, where , and are parameters.

Assumption 1.

Function is closed, proper, convex and admits the properties that (i) is continuously differentiable on , assumed to be open, and (ii) is strongly convex on any compact convex subset of .

We are now ready to state the following theorem to investigate the feasibility of our problem.

Theorem 1.

(latent feasibility re-characterization) Let be the solution set of Eq. (1) (i.e., ), then is invariant on . That is, given any , can be explicitly characterized as


Please notice that the proofs of all our theoretical results are presented in Appendix.

Following Theorem 1, we define and . Then the original bilevel model in Eq. (2) can be equivalently reformulated as the following single-level constrained optimization problem:


where denotes the indicator function of . Now the problem of solving BIO is reformulated to the single level standard optimization Eq. (4). In order to solve the single level reformulation, we introduce the following augmented Lagrangian function with auxiliary variables ,

where denote the dual multipliers and denotes the penalty parameter. Then the proximal ADMM (with ) reads as follows:


Thanks to Theorem 1, we directly have a corollary to guarantee the convergence of Eq. (5) toward the global optimal solutions of Eq. (2).

Corollary 1.

Suppose that the problem in Eq. (4) has KKT solutions. Let be the sequence generated by Eq. (5) on problem (4), then converges to the KKT point set of Eq. (4). In particular, the sequence converges to the global optimal solutions of Eq. (2).

Remark 1.

Indeed, we can further estimate a nice linear convergence rate of Eq. (5) for particular models. That is, if we have that takes the form that where is some given linear operator, satisfies Assumption 1, represents (1) convex polyhedral regularizer; (2) group-lasso regularizer; (3) sparse group lasso regularizer, is a convex polyhedral function, then converges linearly to the KKT point set of problem in Eq. (4). In particular, the sequence converges linearly to the global optimal solutions of Eq. (2).

4 Theoretical Investigations

With Theorem 1, we have that the solutions returned by solving Eq. (4) can exactly optimize the bilevel problem in Eq. (2). It can be seen that our single-level reformulation based optimization scheme heavily relies on the solution set re-characterization. In particular, we require one solution of the lower-level subproblem (i.e., ) to establish the feasible set .

However, in general, obtaining such exact solution is intractable. That is, we in practice can only calculate a solution with errors for the lower-level subproblem in Eq. (1), i.e., obtain a point satisfying , where is the distance mapping, measures the computational errors and denotes the solution set of the lower-level subproblem (defined in Eq. (3)). As a consequence, we actually should consider the practical optimization process of Eq. (2) as solving an approximation of Eq. (4), which can be formulated as follows444Please notice that Eq. (6) is only used for our theoretical analysis, but not practical computation.:


In the following, we shall analyze the convergence behaviors and stability properties of our practical computation (can be abstractly formulated as Eq. (6)) from the perturbation analysis perspective. Specifically, we consider the errors for solving as the perturbation of optimizing Eq. (4) and obtain the following constructive results:

  • Convergence (Theorem 2): As the error decreases to , the solution of Eq. (6) strictly converges to the solution set of the bilevel problem in Eq. (2).

  • Stability (Theorem 3): The proximity from the optimal solution of Eq. (6) to the solution set of bilevel problem in Eq. (2) can be strictly dominated in terms of .

4.1 Convergence Analysis

Before proving our formal convergence result, we first introduce some necessary notations. By respectively considering and in Eq. (6) as perturbed and , we are now aim to investigate the stability of the following parameterized optimization problem


where , and for any given . Moreover, we shall need the following notations.

  • The feasible set mapping of : .

  • The optimal value mapping of : .

  • The solution set mapping of : .

Continuity properties of set-valued mapping is developed in terms of outer and inner limits:

Definition 1.

A set-valued mapping is outer semicontinuous (OSC) at when and inner semicontinuous (ISC) at when It is called continuous at when it is both OSC and ISC at , as expressed by

Lemma 1.

Suppose that is a continuous function. If is continuous at and , then is outer semicontinuous at .

Remark 2.

We shall clarify the continuity assumption regarding in Lemma 1. In fact, when is a convex polyhedral function, is a closed polyhedral convex mapping. Then according to Theorem 3C.3 in [32], we know that is Lipschitz continuous, i.e. there exists such that for all ,

where for any nonempty sets and , is given by , and . Therefore, when is a convex polyhedral function, all the assumptions about in the lemma above are satisfied.

Now we are ready to induce the main result to guarantee the convergence of our proposed optimization scheme.

Theorem 2.

Suppose that is a convex polyhedral function and let be the solution returned by solving Eq. (6) with errors . If , then

  1. For any accumulation point of the sequence , we have that . That is, solves bilevel problem Eq. (2).

  2. If is coercive, then the sequence is bounded and hence admits at least one accumulation point.

4.2 Stability Analysis

Before we can establish the desired stability result for Eq. (6), we need the following general stability analysis as preliminaries.

Proposition 1.

Suppose that there exists neighborhood of some point such that where is Lipschitz continuous with modulus on , and there exist such that and Then for any , we have

According to the results obtained above, together with the arguments given in the proof of Theorem 2, we also have following stability guarantees as follows.

Theorem 3.

For given , let represents the solution returned by solving Eq. (6). Suppose that is a convex polyhedral function. If there exists a neighborhood of some point such that and is Lipschitz continuous on , then there exist such that for any , we have

(a) Relative Error (b) Reconstruction Error
Fig. 1: The iteration behaviors of BIO with different .
0.5430 51.33 0.9988
2.9989 25.37 0.6720
0.1478 1.2901 26.62 0.7797
1.7843 0.9153 34.82 0.9662
70.6342 0.8953 34.93 0.9577
131.0989 0.8339 34.96 0.9400
1.7843 2.1665 29.63 0.9029
1.7843 0.3729 36.18 0.9728
TABLE I: Quantitative results of BIO with different settings in Figs. 1-2. Time and Time are the CPU time (in seconds) for calculating and the iteration of BIO, respectively.
(a) Relative Error (b) Reconstruction Error
Fig. 2: The iteration behaviors of BIO with different .
Blurry Input Ground Truth APG: 31.45
FTVd: 32.36 Ours: 34.88 Ours: 35.75
Fig. 3: Comparisons of APG, FTVd, Ours (BIO with ) and Ours (BIO with ). The PSNR score is reported below each subfigure.
34.05/0.9326 31.76/0.9195 33.48/0.9360 33.52/0.9317
33.53/0.9317 34.20/0.9457 34.32/0.9459 34.82/0.9558
MLP IRCNN MSWNNM Ours Ground Truth
Fig. 4: Results of images corrupted by a kernel with the size of from Levin et al’ dataset [34]. The PSNR/SSIM scores are reported below each subfigure.

5 BIO for IIPs in Computer Vision

Indeed, a variety of computer vision tasks can be considered as IIPs. So in this section we demonstrate how to apply BIO to formulate and optimize typical computer vision problems (e.g., image restoration and compressive sensing MRI)555Please refer to Appendix for more algorithmic details of these applications..

Specifying Knowledge-driven Latent Feasibility. In particular, we first specify the lower-level subproblem by defining and in Eq. (2) to reveal the task information (i.e., fidelity) and the natural image distributions (i.e., priors), respectively. Here denotes the norm on the image gradient (a.k.a. total variation regularization [3]), which has been recognized as a proper and generic distribution assumption for the natural images. As for the task-specific fidelity, is actually used to guarantee that should be structurally similar to the observation after the degradation . So we define as a blur matrix for image restoration [5] or under-sampling matrix and Fourier transformation for compressive sensing MRI (i.e., ) [27]. In this way, we actually define a nested optimization task to introduce implicit feasibility to narrow down the solution space. We argue that compared with these explicitly defined constraints in standard optimization model, which can only reveal straightforward conditions, the knowledge-driven latent feasibility provided by BIO is definitely more flexible and powerful.

Investigating Data-driven Loss Function. With the given latent constraints, we are ready to introduce our upper-level subproblem in Eq. (2

). Indeed, we only slightly modify the standard loss function as

, where denotes a given mapping (e.g., pre-trained deep network architectures) on the observed image. Thus by learning on collected training set, we can successfully incorporate data information in BIO formulation666We state the details on the design and training of in Sec. 6. . Furthermore, thanks to the generic structure in Eq. (2), we can also introduce additional priors using to enforce specific constraints on . For example, in CS-MRI problem, we introduce a nonsmooth regularization with a transform (e.g., Discrete Cosine Transform (DCT) or Discrete Wavelet Transform (DWT), etc) to further induce sparsity of images in the transformed space [33].

FTVd [3] FISTA [38] HL [39] IDDBM3D [40] EPLL [41] MLP IRCNN [11] MSWNNM [5] BIO (Ours)
PSNR 29.38 31.81 30.12 31.53 31.65 31.32 32.28 32.50 32.79
SSIM 0.8819 0.9010 0.8961 0.9043 0.9258 0.8991 0.9200 0.9247 0.9318
TABLE II: Averaged image restoration quantitative performance (i.e., PSNR and SSIM) on Levin et al.’ benchmark.
32.71/0.9270 30.62/0.9263 33.27/0.9526 31.54/0.9429
32.80/0.9424 33.83/0.9555 33.85/0.9512 33.98/0.9582
MLP IRCNN MSWNNM Ours Ground Truth
Fig. 5: Results of images corrupted by a kernel with the size of from Levin et al’ dataset [34]. The PSNR/SSIM scores are reported below each subfigure.

6 Experimental Results

In this section, we first conduct numerical experiments to study the iteration and convergence behaviors of our proposed BIO. Then we compare the practical performance of BIO with state-of-the-art approaches (including deep learning methods) on two real-world IIP applications in low-level computer vision area, i.e., image restoration and compressed sensing MRI. All the experiments are conducted on a PC with Intel Core i7 CPU at 3.7GHz, 32GB RAM and a NVIDIA GeForce GTX 1080Ti 11GB GPU.

6.1 Model Evaluation and Verification

We first analyze and evaluate the important components of BIO on the image restoration task. As discussed in the above theoretical part, it is necessary to specify a particular to setup our algorithm. Here we consider different strategies to obtain it, i.e., set as the corrupted observation , or the numerical solutions i.e., , where is returned by solving the lower-level subproblem in Eq. (1) by APG [35] with different relative errors (i.e., is less than , or ). These three settings are denoted as and in Table I and Fig. 1. We also consider the optimal solution of the lower-level subproblem (i.e., )777Since the lower-level subproblem in Eq. (1) is convex and APG is also a strictly converged method, here we just compute iterations to numerically obtain the optimal solution . in these experiments as a reference. It can be seen from Fig. 1 and Table I (the top part) that the performance of BIO with is the upper bound of all these settings. We also observed that both and obtain very close results to the ideal strategy. But the speed of is much higher than . All these results actually have verified the nice convergence and stability properties of BIO. Thus in the following we always adopt to setup BIO for these applications.

30.31/0.8884 27.51/0.8487 29.48/0.8885 28.72/0.8886
29.89/0.9003 30.25/0.9078 30.36/0.9150 31.24/0.9206
MLP IRCNN MSWNNM Ours Ground Truth
Fig. 6: Results of images corrupted by a kernel with the size of from Levin et al’ dataset [34]. The PSNR/SSIM scores are reported below each subfigure.
Input HL: 21.18/0.4917 EPLL: 19.08/0.5708
IDDBM3D: 25.27/0.7809 MSWNNM: 25.81/0.8047 BIO (Ours): 26.54/0.8425
Fig. 7: Image restoration results on a challenging color image with a large-size kernel (). The PSNR/SSIM scores are reported below each subfigure. The PSNR/SSIM scores are reported below each subfigure.
Cartesian mask Gaussian mask Radial mask
Fig. 8: Averaged compressed sensing MRI results on IXI dataset. Top and Bottom row are and sampling rate, respectively. In each subfigure, the upper right is the best.
34.73/0.9267 36.82/0.9511 31.97/0.9009 32.45/0.9168 37.80/0.9583 38.35/0.9635
43.66/0.9792 44.12/0.9761 42.39/0.9776 46.31/0.9911 45.83/0.9832 46.88/0.9917
33.80/0.8891 34.85/0.9408 33.79/0.9320 35.02/0.9556 35.92/0.9504 36.04/0.9641
Fig. 9: Visual comparison of Compressive Sensing MRI at the sparse -space data with different undersampling patterns and 30% sampling rates (Top row: Cartesian mask. Middle row: Gaussian mask. Bottom row: Radial mask).
27.72/0.8155 26.84/0.7929 25.38/0.7488 26.26/0.7740 28.22/0.8260 28.78/0.8440
Fig. 10: Visual comparison of Cartesian mask at 30% sampling rate). The PSNR/SSIM scores are reported below each subfigure.

It is known that another important component of BIO is the operator in the upper-level subproblem. Here we also consider different strategies to design . Specifically, the most naive way to obtain this operator is just setting it as the identity mapping (i.e., , denoted as ). We can also define this operator by introducing a deconvolution loss and solving it in closed-form (denoted as ), i.e., with a trade-off (fix it as

for all the experiments). Inspired by the recent developments of deep learning, here we also introduce convolution neural networks (CNN) 

[36] as (denoted as

) in BIO. That is, we design a simple denoising CNN architecture, which consists of seven dilated convolutions, six ReLU operations (plugged between each two convolution layers), and five batch normalizations (plugged between convolution and ReLU, except the first convolution layer). We collected 800 natural images from ImageNet database 

[37] and add Gaussian noises to generate the training data for our CNN-based denoiser in all the experiments. We observed in Fig. 2 and Table I that the task-inspired operation is better than the naive strategy. While our learning-based module can significantly improve both the numerical performance and the final quantitative and qualitative results, which successfully verify the values of our knowledge and data integration paradigm in BIO.

From the optimization perspective, we can also understand that BIO actually provides a bilevel way to improve the optimization process of the lower-level model. To verify this effectiveness, we utilize two well-known numerical optimization methods, i.e., FTVd [3] and APG [35], to directly solve the lower-level model in Eq. (1). We compared their results and two versions of our BIO, including Ours(with the ) and Ours (with the ) in Fig. 3. It can be easily seen that both of our two BIO solvers obtained remarkably better performance than traditional numerical optimization approaches. Especially, the learning-based strategy is the best among all the compared methods, which successfully verify the power of our data-driven bilevel integration mechanism.

6.2 Applications in Computer Vision

Image Restoration. We now compare BIO with state-of-the-art image restoration approaches, including FTVd [3], FISTA [38], HL [39], IDDBM3D [40], EPLL [41], MLP [42], IRCNN [11] and MSWNNM [5]. We first conduct experiments on the well-known Levin et al.’ benchmark [34], which includes 32 images of the size and blurred by 8 different kernels of the size ranging from to . The visual comparisons on three example images from this benchmark are plotted in Fig. 4-6, it can be easily seen that our proposed algorithm achieves the prominent detail recovery with highest PSNR/SSIM scores in different cases. We also report the averaged quantitative scores (i.e., PSNR and SSIM) in Table II. We can see that the deep learning based IRCNN approach can achieve much better performance than traditional optimization based methods. The very recently developed MSWNNM adopted a multi-scale technique to investigate rich image details, thus its results are even better. While thanks to the novel interpretation of knowledge-driven and data-driven mechanisms, our BIO obtain the best quantitative and qualitative results among all the tested methods. In addition, a color image () corrupted by a very large kernel (of the size ) is also utilized to further evaluate these approaches, which is showed in Fig. 7. Obviously, all these compared state-of-the-art approaches tend to the unclear details depict. Again, BIO recovered richer textures and more details of the image, thus performed the best.

Compressive Sensing MRI (CS-MRI). We then evaluate BIO on the CS-MRI problem. Specifically, we conduct all the experiments on 55 images with the size of from the widely-used IXI MRI benchmark888 Our experiments contains three types of undersampling patterns (i.e., Cartesian, Gaussian, and Radial mask) and two sampling rates (i.e., , ) to generate the sparse -space data. We compared BIO with many existing state-of-the-art methods including TV [43], SIDWT [33], PBDW [44], PANO [45], FDLCP [46], ADMM-Net [27], BM3D-MRI [18], and TGDOF [47]. The scatter plots in Fig. 8 illustrated the quantitative results of all the compared approaches on the test data set, including three types of undersampling patterns and two kinds of sampling rates. It is easy to see that our BIO achieved the best results according to both PSNR and SSIM scores (i.e., the upper right red points) in all situations. We then consider three types of masks with 30% sampling rates to fully verify the performance. All these results are demonstrated in Fig. 9. Clearly, our BIO achieved outstanding performances with richer details and the highest quantitative scores in all these scenarios. Further, we generate challenging chest data using a Cartesian mask with a sampling rate of in Fig. 10. Consistently, BIO achieves the best qualitative and quantitative performance. The above sufficient experiments fully indicate the superiority of our BIO in terms of CS-MRI problem.

7 Conclusions

This paper studied a bilevel optimization paradigm to integrate knowledge and data for formalizing and optimizing ill-posed inverse problems. We theoretically proved our established solving scheme can strictly converge to the global optimal solution of BIO. Plenty of experiments on two real-world computer vision applications fully revealed the efficiency and effectiveness of the proposed framework.

Appendix A Proofs of Our Theoretical Results

In the following, we first provide detailed proofs for the propositions and theorems in our manuscript. The details for the bilevel optimization model and iteration schemes for particular applications are also summarized.

a.1 Proof of Theorem 1


Given a solution . First, for any , we have , thus


For any , if , then by the strong convexity of , there exists such that

And since , by the convexity of , we have

Combining the two inequalities given above, we obtain

which contradicts to the fact that . Next, since , and , we have , and thus


Upon combing Eq. (8) and Eq. (9), we reach the re-characterization of as Eq. (3). ∎

a.2 Proof of Corollary 1

Remark 3.

Indeed, we can further estimate a nice linear convergence rate of Eq. (5) for particular models. That is, if we have that takes the form that where is some given linear operator, satisfies Assumption 1, represents (1) convex polyhedral regularizer; (2) group-lasso regularizer; (3) sparse group lasso regularizer, is a convex polyhedral function, then converges linearly to the KKT point set of problem in Eq. (4). In particular, the sequence converges linearly to the global optimal solutions of Eq. (2).


By Theorem 1, we have the equivalence between Eq. (2) and Eq. (4). Thus the convergence of the iterations in Eq. (5) can be directly guaranteed via standard convergence results of proximal ADMM [48]. Moreover, by applying the investigations in [49], we can further obtain a linear convergence rate estimation for our iteration in Eq. (5). ∎

a.3 Proof of Lemma 1

Inspired by Theorem 3B.5 in [32], we have following result.


First, we show that . Let , since is continuous at , it is inner semicontinuous at . Thus for any sequence , we get the existence of a sequence of points with such that as . Then for any there exists such that

which implies

We next show the outer semicontinuity of at . For any with such that . Since is outer semicontinuous at , we have . By the continuity of and upper semicontinuity of at , we have

which implies

That is, is outer semicontinuous at according to Definition 1. ∎

a.4 Proof of Theorem 2


Because for any , we have with and . Therefore, , where is the Lipschitz continuity modulus of . Note that the Lipschitz continuity modulus is guaranteed to exist because is a convex polyhedral function. Then the first argument follows from Lemma 1 directly. The second argument actually follows from the fact that and is coercive. ∎

a.5 Proof of Proposition 1

Inspired by Proposition 4.37 in [50], we have following result.


For any , let and , and since and are both closed convex sets, and are well defined. Because , we have , and thus and .

Since , for any point , we have

Since can be any point in , we have


Next, we have

Combining with Eq. (10), we get

and thus

a.6 Proof of Theorem 3


As stated in our manuscript, according to the results proved in Proposition 1, together with the arguments given in the proof of Theorem 2, we can directly have the stability guarantees in this theorem. ∎

Appendix B Specific Iteration Schemes for Applications

In this part, the details for the bilevel optimization model and iteration schemes for particular applications including image restoration and CS-MRI are summarized.

b.1 Image Restoration

We specific the general BIO model in Eq. (2) for image restoration as follows:


where is the corrupted observation, is the desired clean image, is a blur matrix, and is the data-driven operator. As in the case where the regularizer in Eq. (4) is polyhedral convex, Eq. (4) can be reformulated in the following particular way with auxiliary variables to ease the computation. By introducing auxiliary variables , , , , we can obtain the feasible set as , where and for any given , denotes the all one vector, , , , are the corresponding dual multipliers. Based on this reformulation, the proximal ADMM updating can be summarized as:

where are respectively the penalty and regularization parameters, denotes the -th element of the given vector,

is the identity matrix, and

, .

b.2 Cs-Mri

We specific the general BIO model in Eq. (2) for CS-MRI as follows:


where is the discretized image to be rec onstructed, is the acquired -space data, is the under-sampling matrix, denotes the Fourier transform, is also a given transform (e.g., Discrete Cosine Transform (DCT) or Discrete Wavelet Transform (DWT), etc) and is the data-driven operator. Then we can generate the feasible set of the lower-level problem, i.e, , where for any given , , , , are auxiliary variables, , , , are the corresponding dual multipliers. Based on this feasible set, by introducing auxiliary variable , the proximal ADMM updating summarized as:

where are respectively the penalty and regularization parameters, denotes the sign function, is an additional auxiliary variable for the nonconvex upper-level subproblem, is the corresponding dual multiplier, , ,and .


  • [1] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Trans. Image Process, vol. 26, no. 9, pp. 4509–4522, 2017.
  • [2] S. Lunz, C. Schoenlieb, and O. Öktem, “Adversarial regularizers in inverse problems,” in Proc. Advances in Neural Inf. Process. Systems, 2018, pp. 8516–8525.
  • [3] C. Li, W. Yin, H. Jiang, and Y. Zhang, “An efficient augmented lagrangian method with applications to total variation minimization,” Computational Optimization and Applications, vol. 56, no. 3, pp. 507–530, 2013.
  • [4] S. A. Bigdeli, M. Zwicker, P. Favaro, and M. Jin, “Deep mean-shift priors for image restoration,” in Proc. Advances in Neural Inf. Process. Systems, 2017, pp. 763–772.
  • [5] N. Yair and T. Michaeli, “Multi-scale weighted nuclear norm image restoration,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 3165–3174.
  • [6] Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image super-resolution via information distillation network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 723–731.
  • [7] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang, “Non-local recurrent network for image restoration,” in Proc. Advances in Neural Inf. Process. Systems, 2018, pp. 1680–1689.
  • [8] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu, “Residual non-local attention networks for image restoration,” in International Conference on Learning Representations, 2019.
  • [9] K. Li and J. Malik, “Learning to optimize,” in International Conference on Learning Representations, 2017.
  • [10] R. Liu, S. Cheng, L. Ma, X. Fan, Z. Luo et al., “A bridging framework for model optimization and deep propagation,” in Proc. Advances in Neural Inf. Process. Systems, 2018, pp. 4323–4332.
  • [11] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2017, pp. 3929–3938.
  • [12] R. Liu, S. Cheng, L. Ma, X. Fan, and Z. Luo, “Deep proximal unrolling: Algorithmic framework, convergence analysis and applications,” IEEE Trans. Image Process, 2019.
  • [13] R. Liu, S. Cheng, Y. He, X. Fan, Z. Lin, and Z. Luo, “On the convergence of learning-based iterative methods for nonconvex inverse problems,” IEEE Trans. Pattern Anal. Mach. Intell., 2019.
  • [14] S. Dempe, V. Kalashnikov, G. A. Prez-Valds, and N. Kalashnykova, Bilevel Programming Problems: Theory, Algorithms and Applications to Energy Networks.   Springer Publishing Company, Incorporated, 2015.
  • [15] M. Solodov, “An explicit descent method for bilevel convex optimization,” Journal of Convex Analysis, vol. 14, no. 2, p. 227, 2007.
  • [16] A. Beck and S. Sabach, “A first order method for finding minimal norm-like solutions of convex optimization problems,” Mathematical Programming, vol. 147, no. 1-2, pp. 25–46, 2014.
  • [17] S. Sabach and S. Shtern, “A first order method for solving convex bilevel optimization problems,” SIAM Journal on Optimization, vol. 27, no. 2, pp. 640–660, 2017.
  • [18] E. M. Eksioglu, “Decoupled algorithm for mri reconstruction using nonlocal block matching model: Bm3d-mri,” J MATH IMAGING VIS, vol. 56, no. 3, pp. 430–440, 2016.
  • [19] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 1664–1673.
  • [20] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia, “Scale-recurrent network for deep image deblurring,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 8174–8182.
  • [21] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 7794–7803.
  • [22] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit. Workshops, 2017.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2016, pp. 770–778.
  • [24] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proc. IEEE Conf. Int. Conf. Comput. Vis., 2017, pp. 2980–2988.
  • [25] L. Fan, W. Huang, C. Gan, S. Ermon, B. Gong, and J. Huang, “End-to-end learning of motion representation for video understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 6016–6025.
  • [26] J. Zhang and B. Ghanem, “Ista-net: Interpretable optimization-inspired deep network for image compressive sensing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 1828–1837.
  • [27] J. Sun, H. Li, Z. Xu et al., “Deep admm-net for compressive sensing mri,” in Proc. Advances in Neural Inf. Process. Systems, 2016, pp. 10–18.
  • [28] S. Dempe, “Bilevel optimization: theory, algorithms and applications,” Optimization On-line URL http://www. optimization-online. org/DB_HTML/2018/08/6773. html, 2018.
  • [29] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, 2009.
  • [30] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global contrast based salient region detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, 2015.
  • [31] R. Liu, L. Ma, Y. Wang, and L. Zhang, “Learning converged propagations with deep prior ensemble for image enhancement,” IEEE Trans. Image Process, vol. 28, no. 3, pp. 1528–1543, 2019.
  • [32] A. L. Dontchev and R. T. Rockafellar, “Implicit functions and solution mappings,” Springer Monogr. Math., 2009.
  • [33] R. G. Baraniuk, “Compressive sensing,” IEEE Signal Processing Magazine, vol. 24, no. 4, 2007.
  • [34] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, “Understanding and evaluating blind deconvolution algorithms,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2009, pp. 1964–1971.
  • [35] J. M. Bioucas-Dias and M. A. Figueiredo, “A new twist: two-step iterative shrinkage/thresholding algorithms for image restoration,” IEEE Trans. Image Process, vol. 16, no. 12, pp. 2992–3004, 2007.
  • [36] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super-resolution network for multrans. image processle degradations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 3262–3271.
  • [37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [38] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
  • [39] D. Krishnan and R. Fergus, “Fast image deconvolution using hyper-laplacian priors,” in Proc. Advances in Neural Inf. Process. Systems, 2009, pp. 1033–1041.
  • [40] A. Danielyan, V. Katkovnik, and K. Egiazarian, “Bm3d frames and variational image deblurring,” IEEE Trans. Image Process, vol. 21, no. 4, pp. 1715–1728, 2012.
  • [41] D. Zoran and Y. Weiss, “From learning models of natural image patches to whole image restoration,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2011, pp. 479–486.
  • [42] C. J. Schuler, H. C. Burger, S. Harmeling, and B. Scholkopf, “A machine learning approach for non-blind image deconvolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2013, pp. 1067–1074.
  • [43] M. Lustig, D. L. Donoho, J. M. Santos, and J. M. Pauly, “Compressed sensing mri,” IEEE SIGNAL PROCESSING MAGAZINE, vol. 25, no. 2, pp. 72–82, 2008.
  • [44] X. Qu, D. Guo, B. Ning, Y. Hou, Y. Lin, S. Cai, and Z. Chen, “Undersampled mri reconstruction with patch-based directional wavelets,” MAGN RESON IMAGING, vol. 30, no. 7, pp. 964–977, 2012.
  • [45] X. Qu, Y. Hou, F. Lam, D. Guo, J. Zhong, and Z. Chen, “Magnetic resonance image reconstruction from undersampled measurements using a patch-based nonlocal operator,” MED IMAGE ANAL, vol. 18, no. 6, pp. 843–856, 2014.
  • [46] Z. Zhan, J.-F. Cai, D. Guo, Y. Liu, Z. Chen, and X. Qu, “Fast multiclass dictionaries learning with geometrical directions in mri reconstruction,” IEEE Trans. Biomed. Engineering, vol. 63, no. 9, pp. 1850–1861, 2016.
  • [47] R. Liu, Y. Zhang, S. Cheng, X. Fan, and Z. Luo, “A theoretically guaranteed deep optimization framework for robust compressive sensing mri,” in

    Proceedings of Association for the Advancement of Artificial Intelligence

    , 2019.
  • [48] B. He, L.-Z. Liao, D. Han, and H. Yang, “A new inexact alternating directions method for monotone variational inequalities,” Mathematical Programming, vol. 92, no. 1, pp. 103–118, 2002.
  • [49] X. Yuan, S. Zeng, and J. Zhang, “Discerning the linear convergence of admm for structured convex optimization through the lens of variational analysis,” Optimization Online Preprint, 2018.
  • [50] J. F. Bonnans and A. Shapiro, Perturbation analysis of optimization problems.   Springer Science & Business Media, 2013.