1 Introduction
Inverse problems naturally occur in almost all the areas of science and engineering [1, 2]. This problem can be formulated using an analytical description
of the forward operator in some given vector spaces
and. For example, typical tasks in computer vision that can be phrased as inverse problems include image deblurring, where
is the blur kernel, superresolution, where
downsamples highresolution images, or image inpainting, where
is given by a mask corresponding to the inpainting domain. In medical imaging, common forward operatorsare the Fourier transform in MRI and the ray transform in CT. In this way, the target of inverse problems is to obtain an unknown
from the given noisy observation . However, in most realworld scenarios, the above problem is always illposed, that is, the solution either not exists, is not unique or does not depend continuously on the datum . Therefore, the scope of this work is to investigate the socalled Illposed Inverse Problems (IIPs for short). In the past years, a variety of approaches have been proposed to address IIPs.The first category of methods aim to find the maximum likelihood solution, resulted in the minimization of the data discrepancy [3, 4]. To penalize the unfeasible solutions, different variational regularizations have been developed to encode prior information about and they actually seek to minimize the regularized objective functionals [5]. These energy minimization approaches can be applied to solve various types of IIPs, but they often have poorer performance on an individual task. This is mainly because it is very hard to design correct priors to appropriately fit our desired solutions in realworld problems.
A more recent trend is to establish regressiontype neural networks to learn a direct inverse mapping
from the observation space to the solution space [6, 7, 8]. Such methods have achieved impressive performance in some practical applications. While the obvious disadvantages with these learned deep approaches are that their performances completely depend on rich highquality training data and it is also challenging to learn a single network to address various different tasks. Theoretically, it is extremely hard to understand and interpret these complex network structures, due to their heuristic nature. Very recently, preliminary works have begun to link conventional approaches with taskspecific computational modules (e.g., deep learning architectures) and developed a series of optimizationinspired learning methods to solve IIPs
[9, 10]. The idea is to unroll an existing optimization process and replace the explicit iterative updating rule with handdesigned operators and/or learned architectures [11]. Unfortunately, due to the uncontrolled inexact computational modules, it is hard to theoretically guarantee the convergence of these methods. The works in [10, 12, 13] tried to introduce error control rules to correct improper modules and thus prove the convergence of their trained iterations. However, these additional error checking process will slow down the particular computation when handling challenging problems. Moreover, deeper theoretical investigations (e.g., the stability of the iterations) are still missing.1.1 Our Contributions
In this work, we propose a completely new paradigm, named Bilevel Integrative Optimization (BIO), to formulate and optimize IIPs in knowledge and data collaboration manner. Different from traditional variational modeling techniques, which only consider a single energy formulation, we introduce a specific bilevel system^{1}^{1}1Please notice that our BIO formulation is actually not a standard bilevel programming with coupled variables, but a specific case with one single variable [14].to combine principled energies, handdesigned priors and even deeply trained network architectures to formulate IIPs. Since in BIO we consider general convex composite (not necessarily smooth) energies in both upper and lower subproblems, to our best knowledge, no existing bilevel optimization algorithms can be used to address the resulted optimization problem. Fortunately, by recharacterizing the latent feasibility as explicit set constraints, an efficient iteration scheme is developed to address BIO. Theoretically, we can prove that our established iteration scheme can strictly converge to the global optimal solution of BIO. We also show that the computational errors during our practical iterations can be successfully dominated, thus verify the stability of the proposed method. Extensive experiments on realworld computer vision applications finally demonstrate the efficiency and effectiveness of the proposed framework. In summary, our contributions mainly include:

As a new modeling paradigm, BIO provides a flexible and modularized framework to integrate principled energies, handdesigned priors and deeply trained network architectures to formulate IIPs.

By investigating the stability of the nested lowerlevel subproblem, the validity of our latent feasibility recharacterization based single level reformulation scheme is strictly convinced.
2 Related Work
2.1 Existing Approaches for Inverse Problems
The most widely used approaches in solving IIPs are to minimize the massfit against data, i.e., where the generalized inverse is unbounded. Unfortunately, these purely analytic models are typically just an approximation to these realworld tasks. Moreover, given no presumption about the nature of the solution (e.g., the statistics of the image), it is virtually impossible to solve the IIPs, especially in real scenarios.
Over the past decades, one of the most widely used way to approach IIPs is by defining a likelihood and prior, and optimizing for the Maximum A Posteriori (MAP) solution. The sparsity of natural images is commonly utilized for defining the priors in some primitive works [3]. The nonlocal selfsimilarity is then developed for constructing the priors [18]. As further research, researchers tend to design complicated priors for the competitive performance [4, 5]. While more expressive priors allow for better representation of the signal of interest, they will typically make optimization more difficult. In fact, only for a few trivial priorlikelihood pairs will inference remain convex. In practice, one often has to resort to approximations of the objective and to approximate doubleloop algorithms in order to allow for scalable inference.
The huge field of machine learning provides several datadriven approaches to tackle these problems by using training datasets to either construct a problem adapted forward operator and use an established inversion method or to solve the inverse problem directly
[19, 20]. In particular, deep learning approaches using neural networks with multiple internal layers have become popular over the recent few years [21]. The development of deep learning lies in the massive data [22] and the significant network architecture [23]. Nowadays, the large scale of data has been seen everywhere. Consequently, the design of architecture plays a decisive role to push forward deep learning. Indeed, the big advancements of deep learning owing to some targeted design of architecture [24, 8]. There is no denying that the tremendous success of deep learning has achieved in many fields. But it is a vital shortcoming that the distribution of training set directly decide the performance in test data. Moreover, no consistent mathematical theory on deep neural networks for inverse problems has been developed yet besides the stunning experimental results, which have been published so far for many different types of applications to inverse problems.Some recent works put more emphasis on building the connection between with traditional optimization and learning architecture. One is to design the complex priors with learning parameters and unrolling the iteration scheme to obtain the endtoend learned architecture [25, 26]. The other is to substitute the subproblem with the datadriven network in the optimization process [27, 11]. To be brief, the key intentions of such methods lie in how to derive the learned architecture from the optimization model to incorporate the data prior. But unfortunately, the convergence behaviors and stability properties become difficult to analyze although the performances indeed bring the advancements, which loses the strengths of traditional optimization. There exists some works [10] have presented the convergence analysis by introducing the error control rules, which bring the extra computational burden and the theoretical results are not satisfied.
2.2 Bilevel Optimization Techniques
In general, bilevel models are hierarchical optimization problems, where the feasible region of the upperlevel subproblem is restricted by the graph of the solution set mapping of the lowerlevel subproblem, see, e.g., [28]. Due to the coupling of the hierarchical subproblems, it is indeed extremely challenging to optimize classic bilevel problems. Recently, a series of works [15, 16, 17] have tried to address specific bilevel models, in which they minimize a convex function over the set of constrained minimizers of another convex function. Although upperlevel variables exactly coincide with lowerlevel variables, the resulted convex bilevel problems are still extraordinarily illposed. The early work in [15] considered this category of models with both smooth upper and lower subproblems. Recent works in [16, 17] introduced minimal norm gradient and sequential averaging methods to address convex bilevel problems with strongly convex and smooth upperlevel subproblem. Unfortunately, the problem settings in [15, 16, 17] are too restrictive for our tasks. Thus neither of these schemes is amenable as a solution method for the optimization problem in this work.
3 The Proposed Framework
In this section, we establish a generic convex bilevel framework to integrate different categories of information to formulate and optimize challenging inverse problems^{2}^{2}2We will demonstrate how to apply our framework to integrate both knowledge and data for realworld applications in Sec. 5..
3.1 Bilevel Integration Modeling
We first consider a composite optimization formulation:
(1) 
where , are extendedvalued convex functions and may be possibly nonsmooth. The model in Eq. (1
) has drawn increasing attention recently for the emerging applications in leaning and vision areas, e.g., face recognition
[29], saliency detection [30], image restoration [10], optical flows [25] and many others [31]. While different from these existing approaches, which just formulate their objectives as specific forms of Eq. (1), here we would like to consider the solution set of Eq. (1) as the lowerlevel latent feasibility of our task and actually aim to solve the following bilevel problem with hierarchy(2) 
where we introduce another composite function as our upperlevel objective. Similarly, we also assume that both , are also extendedvalued convex functions and may be possibly nonsmooth. In this way, we actually obtain a convex bilevel optimization model, which implicitly integrate two different hierarchies of information (i.e., and ) for IIPs modeling.
3.2 A New Single Level Reformulation Strategy
From an optimization perspective, we actually utilize the lowerlevel subproblem Eq. (1) to characterize the feasible region of BIO in Eq. (2). The main benefit of such strategy is that we can take full advantage of our domain knowledge of the task. Indeed, Eq. (2) can be recognized as a specific convex bilevel model. However, both upper and lower levels are in general lack of smoothness and strong convexity. Thus the existing solution schemes [15, 16, 17] no longer admit any theoretical validity. In fact, when the upperlevel subproblem is in the absence of strong convexity, directly solving Eq. (2) with such latent feasibility is extremely challenging.
Motivated by the observation that in applications, the latent feasibility in Eq. (1) usually possesses some underlying structures, in this paper, we develop a new optimization strategy with solid theoretical guarantees for structural BIO. In particular, we recharacterize the latent feasibility in Eq. (1). Therefore we shall reformulate Eq. (2) in terms of the recharacterization of Eq. (1) into a standard optimization problem which is numerically trackable. To this end, we first list some structural assumptions, which are necessary for our following analysis and easily to satisfy in applications of practical interests. Specifically, throughout this present paper, we suppose that takes the form that , where is some given linear operator and has the following structural properties^{3}^{3}3
Some commonly used functions in learning and vision (e.g., linear regression
and likelihood estimation under Poisson noise ) automatically satisfy these assumptions, where , and are parameters.Assumption 1.
Function is closed, proper, convex and admits the properties that (i) is continuously differentiable on , assumed to be open, and (ii) is strongly convex on any compact convex subset of .
We are now ready to state the following theorem to investigate the feasibility of our problem.
Theorem 1.
(latent feasibility recharacterization) Let be the solution set of Eq. (1) (i.e., ), then is invariant on . That is, given any , can be explicitly characterized as
(3) 
Please notice that the proofs of all our theoretical results are presented in Appendix.
Following Theorem 1, we define and . Then the original bilevel model in Eq. (2) can be equivalently reformulated as the following singlelevel constrained optimization problem:
(4) 
where denotes the indicator function of . Now the problem of solving BIO is reformulated to the single level standard optimization Eq. (4). In order to solve the single level reformulation, we introduce the following augmented Lagrangian function with auxiliary variables ,
where denote the dual multipliers and denotes the penalty parameter. Then the proximal ADMM (with ) reads as follows:
(5) 
Thanks to Theorem 1, we directly have a corollary to guarantee the convergence of Eq. (5) toward the global optimal solutions of Eq. (2).
Corollary 1.
Remark 1.
Indeed, we can further estimate a nice linear convergence rate of Eq. (5) for particular models. That is, if we have that takes the form that where is some given linear operator, satisfies Assumption 1, represents (1) convex polyhedral regularizer; (2) grouplasso regularizer; (3) sparse group lasso regularizer, is a convex polyhedral function, then converges linearly to the KKT point set of problem in Eq. (4). In particular, the sequence converges linearly to the global optimal solutions of Eq. (2).
4 Theoretical Investigations
With Theorem 1, we have that the solutions returned by solving Eq. (4) can exactly optimize the bilevel problem in Eq. (2). It can be seen that our singlelevel reformulation based optimization scheme heavily relies on the solution set recharacterization. In particular, we require one solution of the lowerlevel subproblem (i.e., ) to establish the feasible set .
However, in general, obtaining such exact solution is intractable. That is, we in practice can only calculate a solution with errors for the lowerlevel subproblem in Eq. (1), i.e., obtain a point satisfying , where is the distance mapping, measures the computational errors and denotes the solution set of the lowerlevel subproblem (defined in Eq. (3)). As a consequence, we actually should consider the practical optimization process of Eq. (2) as solving an approximation of Eq. (4), which can be formulated as follows^{4}^{4}4Please notice that Eq. (6) is only used for our theoretical analysis, but not practical computation.:
(6) 
In the following, we shall analyze the convergence behaviors and stability properties of our practical computation (can be abstractly formulated as Eq. (6)) from the perturbation analysis perspective. Specifically, we consider the errors for solving as the perturbation of optimizing Eq. (4) and obtain the following constructive results:
4.1 Convergence Analysis
Before proving our formal convergence result, we first introduce some necessary notations. By respectively considering and in Eq. (6) as perturbed and , we are now aim to investigate the stability of the following parameterized optimization problem
(7) 
where , and for any given . Moreover, we shall need the following notations.

The feasible set mapping of : .

The optimal value mapping of : .

The solution set mapping of : .
Continuity properties of setvalued mapping is developed in terms of outer and inner limits:
Definition 1.
A setvalued mapping is outer semicontinuous (OSC) at when and inner semicontinuous (ISC) at when It is called continuous at when it is both OSC and ISC at , as expressed by
Lemma 1.
Suppose that is a continuous function. If is continuous at and , then is outer semicontinuous at .
Remark 2.
We shall clarify the continuity assumption regarding in Lemma 1. In fact, when is a convex polyhedral function, is a closed polyhedral convex mapping. Then according to Theorem 3C.3 in [32], we know that is Lipschitz continuous, i.e. there exists such that for all ,
where for any nonempty sets and , is given by , and . Therefore, when is a convex polyhedral function, all the assumptions about in the lemma above are satisfied.
Now we are ready to induce the main result to guarantee the convergence of our proposed optimization scheme.
Theorem 2.
Suppose that is a convex polyhedral function and let be the solution returned by solving Eq. (6) with errors . If , then

For any accumulation point of the sequence , we have that . That is, solves bilevel problem Eq. (2).

If is coercive, then the sequence is bounded and hence admits at least one accumulation point.
4.2 Stability Analysis
Before we can establish the desired stability result for Eq. (6), we need the following general stability analysis as preliminaries.
Proposition 1.
Suppose that there exists neighborhood of some point such that where is Lipschitz continuous with modulus on , and there exist such that and Then for any , we have
According to the results obtained above, together with the arguments given in the proof of Theorem 2, we also have following stability guarantees as follows.
Theorem 3.
For given , let represents the solution returned by solving Eq. (6). Suppose that is a convex polyhedral function. If there exists a neighborhood of some point such that and is Lipschitz continuous on , then there exist such that for any , we have
(a) Relative Error  (b) Reconstruction Error 
Time  Time  PSNR  SSIM  
—  0.5430  51.33  0.9988  
—  2.9989  25.37  0.6720  
0.1478  1.2901  26.62  0.7797  
1.7843  0.9153  34.82  0.9662  
70.6342  0.8953  34.93  0.9577  
131.0989  0.8339  34.96  0.9400  
1.7843  2.1665  29.63  0.9029  
1.7843  0.3729  36.18  0.9728 
(a) Relative Error  (b) Reconstruction Error 
Blurry Input  Ground Truth  APG: 31.45 
FTVd: 32.36  Ours: 34.88  Ours: 35.75 
—  34.05/0.9326  31.76/0.9195  33.48/0.9360  33.52/0.9317 
Blurry Input  FISTA  HL  IDDBM3D  EPLL 
33.53/0.9317  34.20/0.9457  34.32/0.9459  34.82/0.9558  — 
MLP  IRCNN  MSWNNM  Ours  Ground Truth 
5 BIO for IIPs in Computer Vision
Indeed, a variety of computer vision tasks can be considered as IIPs. So in this section we demonstrate how to apply BIO to formulate and optimize typical computer vision problems (e.g., image restoration and compressive sensing MRI)^{5}^{5}5Please refer to Appendix for more algorithmic details of these applications..
Specifying Knowledgedriven Latent Feasibility. In particular, we first specify the lowerlevel subproblem by defining and in Eq. (2) to reveal the task information (i.e., fidelity) and the natural image distributions (i.e., priors), respectively. Here denotes the norm on the image gradient (a.k.a. total variation regularization [3]), which has been recognized as a proper and generic distribution assumption for the natural images. As for the taskspecific fidelity, is actually used to guarantee that should be structurally similar to the observation after the degradation . So we define as a blur matrix for image restoration [5] or undersampling matrix and Fourier transformation for compressive sensing MRI (i.e., ) [27]. In this way, we actually define a nested optimization task to introduce implicit feasibility to narrow down the solution space. We argue that compared with these explicitly defined constraints in standard optimization model, which can only reveal straightforward conditions, the knowledgedriven latent feasibility provided by BIO is definitely more flexible and powerful.
Investigating Datadriven Loss Function. With the given latent constraints, we are ready to introduce our upperlevel subproblem in Eq. (2
). Indeed, we only slightly modify the standard loss function as
, where denotes a given mapping (e.g., pretrained deep network architectures) on the observed image. Thus by learning on collected training set, we can successfully incorporate data information in BIO formulation^{6}^{6}6We state the details on the design and training of in Sec. 6. . Furthermore, thanks to the generic structure in Eq. (2), we can also introduce additional priors using to enforce specific constraints on . For example, in CSMRI problem, we introduce a nonsmooth regularization with a transform (e.g., Discrete Cosine Transform (DCT) or Discrete Wavelet Transform (DWT), etc) to further induce sparsity of images in the transformed space [33].FTVd [3]  FISTA [38]  HL [39]  IDDBM3D [40]  EPLL [41]  MLP  IRCNN [11]  MSWNNM [5]  BIO (Ours)  

PSNR  29.38  31.81  30.12  31.53  31.65  31.32  32.28  32.50  32.79 
SSIM  0.8819  0.9010  0.8961  0.9043  0.9258  0.8991  0.9200  0.9247  0.9318 
—  32.71/0.9270  30.62/0.9263  33.27/0.9526  31.54/0.9429 
Blurry Input  FISTA  HL  IDDBM3D  EPLL 
32.80/0.9424  33.83/0.9555  33.85/0.9512  33.98/0.9582  — 
MLP  IRCNN  MSWNNM  Ours  Ground Truth 
6 Experimental Results
In this section, we first conduct numerical experiments to study the iteration and convergence behaviors of our proposed BIO. Then we compare the practical performance of BIO with stateoftheart approaches (including deep learning methods) on two realworld IIP applications in lowlevel computer vision area, i.e., image restoration and compressed sensing MRI. All the experiments are conducted on a PC with Intel Core i7 CPU at 3.7GHz, 32GB RAM and a NVIDIA GeForce GTX 1080Ti 11GB GPU.
6.1 Model Evaluation and Verification
We first analyze and evaluate the important components of BIO on the image restoration task. As discussed in the above theoretical part, it is necessary to specify a particular to setup our algorithm. Here we consider different strategies to obtain it, i.e., set as the corrupted observation , or the numerical solutions i.e., , where is returned by solving the lowerlevel subproblem in Eq. (1) by APG [35] with different relative errors (i.e., is less than , or ). These three settings are denoted as and in Table I and Fig. 1. We also consider the optimal solution of the lowerlevel subproblem (i.e., )^{7}^{7}7Since the lowerlevel subproblem in Eq. (1) is convex and APG is also a strictly converged method, here we just compute iterations to numerically obtain the optimal solution . in these experiments as a reference. It can be seen from Fig. 1 and Table I (the top part) that the performance of BIO with is the upper bound of all these settings. We also observed that both and obtain very close results to the ideal strategy. But the speed of is much higher than . All these results actually have verified the nice convergence and stability properties of BIO. Thus in the following we always adopt to setup BIO for these applications.
—  30.31/0.8884  27.51/0.8487  29.48/0.8885  28.72/0.8886 
Blurry Input  FISTA  HL  IDDBM3D  EPLL 
29.89/0.9003  30.25/0.9078  30.36/0.9150  31.24/0.9206  — 
MLP  IRCNN  MSWNNM  Ours  Ground Truth 
Input  HL: 21.18/0.4917  EPLL: 19.08/0.5708 
IDDBM3D: 25.27/0.7809  MSWNNM: 25.81/0.8047  BIO (Ours): 26.54/0.8425 
Cartesian mask  Gaussian mask  Radial mask 
—  34.73/0.9267  36.82/0.9511  31.97/0.9009  32.45/0.9168  37.80/0.9583  38.35/0.9635 
—  43.66/0.9792  44.12/0.9761  42.39/0.9776  46.31/0.9911  45.83/0.9832  46.88/0.9917 
—  33.80/0.8891  34.85/0.9408  33.79/0.9320  35.02/0.9556  35.92/0.9504  36.04/0.9641 
Zerofilling  PANO  FDLCP  ADMMNet  BM3DMRI  TGDOF  BIO (Ours) 
—  27.72/0.8155  26.84/0.7929  25.38/0.7488  26.26/0.7740  28.22/0.8260  28.78/0.8440 
Zerofilling  PANO  FDLCP  ADMMNet  BM3DMRI  TGDOF  BIO (Ours) 
It is known that another important component of BIO is the operator in the upperlevel subproblem. Here we also consider different strategies to design . Specifically, the most naive way to obtain this operator is just setting it as the identity mapping (i.e., , denoted as ). We can also define this operator by introducing a deconvolution loss and solving it in closedform (denoted as ), i.e., with a tradeoff (fix it as
for all the experiments). Inspired by the recent developments of deep learning, here we also introduce convolution neural networks (CNN)
[36] as (denoted as) in BIO. That is, we design a simple denoising CNN architecture, which consists of seven dilated convolutions, six ReLU operations (plugged between each two convolution layers), and five batch normalizations (plugged between convolution and ReLU, except the first convolution layer). We collected 800 natural images from ImageNet database
[37] and add Gaussian noises to generate the training data for our CNNbased denoiser in all the experiments. We observed in Fig. 2 and Table I that the taskinspired operation is better than the naive strategy. While our learningbased module can significantly improve both the numerical performance and the final quantitative and qualitative results, which successfully verify the values of our knowledge and data integration paradigm in BIO.From the optimization perspective, we can also understand that BIO actually provides a bilevel way to improve the optimization process of the lowerlevel model. To verify this effectiveness, we utilize two wellknown numerical optimization methods, i.e., FTVd [3] and APG [35], to directly solve the lowerlevel model in Eq. (1). We compared their results and two versions of our BIO, including Ours(with the ) and Ours (with the ) in Fig. 3. It can be easily seen that both of our two BIO solvers obtained remarkably better performance than traditional numerical optimization approaches. Especially, the learningbased strategy is the best among all the compared methods, which successfully verify the power of our datadriven bilevel integration mechanism.
6.2 Applications in Computer Vision
Image Restoration. We now compare BIO with stateoftheart image restoration approaches, including FTVd [3], FISTA [38], HL [39], IDDBM3D [40], EPLL [41], MLP [42], IRCNN [11] and MSWNNM [5]. We first conduct experiments on the wellknown Levin et al.’ benchmark [34], which includes 32 images of the size and blurred by 8 different kernels of the size ranging from to . The visual comparisons on three example images from this benchmark are plotted in Fig. 46, it can be easily seen that our proposed algorithm achieves the prominent detail recovery with highest PSNR/SSIM scores in different cases. We also report the averaged quantitative scores (i.e., PSNR and SSIM) in Table II. We can see that the deep learning based IRCNN approach can achieve much better performance than traditional optimization based methods. The very recently developed MSWNNM adopted a multiscale technique to investigate rich image details, thus its results are even better. While thanks to the novel interpretation of knowledgedriven and datadriven mechanisms, our BIO obtain the best quantitative and qualitative results among all the tested methods. In addition, a color image () corrupted by a very large kernel (of the size ) is also utilized to further evaluate these approaches, which is showed in Fig. 7. Obviously, all these compared stateoftheart approaches tend to the unclear details depict. Again, BIO recovered richer textures and more details of the image, thus performed the best.
Compressive Sensing MRI (CSMRI). We then evaluate BIO on the CSMRI problem. Specifically, we conduct all the experiments on 55 images with the size of from the widelyused IXI MRI benchmark^{8}^{8}8http://braindevelopment.org/ixidataset/. Our experiments contains three types of undersampling patterns (i.e., Cartesian, Gaussian, and Radial mask) and two sampling rates (i.e., , ) to generate the sparse space data. We compared BIO with many existing stateoftheart methods including TV [43], SIDWT [33], PBDW [44], PANO [45], FDLCP [46], ADMMNet [27], BM3DMRI [18], and TGDOF [47]. The scatter plots in Fig. 8 illustrated the quantitative results of all the compared approaches on the test data set, including three types of undersampling patterns and two kinds of sampling rates. It is easy to see that our BIO achieved the best results according to both PSNR and SSIM scores (i.e., the upper right red points) in all situations. We then consider three types of masks with 30% sampling rates to fully verify the performance. All these results are demonstrated in Fig. 9. Clearly, our BIO achieved outstanding performances with richer details and the highest quantitative scores in all these scenarios. Further, we generate challenging chest data using a Cartesian mask with a sampling rate of in Fig. 10. Consistently, BIO achieves the best qualitative and quantitative performance. The above sufficient experiments fully indicate the superiority of our BIO in terms of CSMRI problem.
7 Conclusions
This paper studied a bilevel optimization paradigm to integrate knowledge and data for formalizing and optimizing illposed inverse problems. We theoretically proved our established solving scheme can strictly converge to the global optimal solution of BIO. Plenty of experiments on two realworld computer vision applications fully revealed the efficiency and effectiveness of the proposed framework.
Appendix A Proofs of Our Theoretical Results
In the following, we first provide detailed proofs for the propositions and theorems in our manuscript. The details for the bilevel optimization model and iteration schemes for particular applications are also summarized.
a.1 Proof of Theorem 1
Proof.
Given a solution . First, for any , we have , thus
(8) 
For any , if , then by the strong convexity of , there exists such that
And since , by the convexity of , we have
Combining the two inequalities given above, we obtain
which contradicts to the fact that . Next, since , and , we have , and thus
(9) 
Upon combing Eq. (8) and Eq. (9), we reach the recharacterization of as Eq. (3). ∎
a.2 Proof of Corollary 1
Remark 3.
Indeed, we can further estimate a nice linear convergence rate of Eq. (5) for particular models. That is, if we have that takes the form that where is some given linear operator, satisfies Assumption 1, represents (1) convex polyhedral regularizer; (2) grouplasso regularizer; (3) sparse group lasso regularizer, is a convex polyhedral function, then converges linearly to the KKT point set of problem in Eq. (4). In particular, the sequence converges linearly to the global optimal solutions of Eq. (2).
Proof.
By Theorem 1, we have the equivalence between Eq. (2) and Eq. (4). Thus the convergence of the iterations in Eq. (5) can be directly guaranteed via standard convergence results of proximal ADMM [48]. Moreover, by applying the investigations in [49], we can further obtain a linear convergence rate estimation for our iteration in Eq. (5). ∎
a.3 Proof of Lemma 1
Inspired by Theorem 3B.5 in [32], we have following result.
Proof.
First, we show that . Let , since is continuous at , it is inner semicontinuous at . Thus for any sequence , we get the existence of a sequence of points with such that as . Then for any there exists such that
which implies
We next show the outer semicontinuity of at . For any with such that . Since is outer semicontinuous at , we have . By the continuity of and upper semicontinuity of at , we have
which implies
That is, is outer semicontinuous at according to Definition 1. ∎
a.4 Proof of Theorem 2
Proof.
Because for any , we have with and . Therefore, , where is the Lipschitz continuity modulus of . Note that the Lipschitz continuity modulus is guaranteed to exist because is a convex polyhedral function. Then the first argument follows from Lemma 1 directly. The second argument actually follows from the fact that and is coercive. ∎
a.5 Proof of Proposition 1
Inspired by Proposition 4.37 in [50], we have following result.
Proof.
For any , let and , and since and are both closed convex sets, and are well defined. Because , we have , and thus and .
Since , for any point , we have
Since can be any point in , we have
(10) 
Next, we have
Combining with Eq. (10), we get
and thus
∎
a.6 Proof of Theorem 3
Appendix B Specific Iteration Schemes for Applications
In this part, the details for the bilevel optimization model and iteration schemes for particular applications including image restoration and CSMRI are summarized.
b.1 Image Restoration
We specific the general BIO model in Eq. (2) for image restoration as follows:
(11) 
where is the corrupted observation, is the desired clean image, is a blur matrix, and is the datadriven operator. As in the case where the regularizer in Eq. (4) is polyhedral convex, Eq. (4) can be reformulated in the following particular way with auxiliary variables to ease the computation. By introducing auxiliary variables , , , , we can obtain the feasible set as , where and for any given , denotes the all one vector, , , , are the corresponding dual multipliers. Based on this reformulation, the proximal ADMM updating can be summarized as:
where are respectively the penalty and regularization parameters, denotes the th element of the given vector,
is the identity matrix, and
, .b.2 CsMri
We specific the general BIO model in Eq. (2) for CSMRI as follows:
(12) 
where is the discretized image to be rec onstructed, is the acquired space data, is the undersampling matrix, denotes the Fourier transform, is also a given transform (e.g., Discrete Cosine Transform (DCT) or Discrete Wavelet Transform (DWT), etc) and is the datadriven operator. Then we can generate the feasible set of the lowerlevel problem, i.e, , where for any given , , , , are auxiliary variables, , , , are the corresponding dual multipliers. Based on this feasible set, by introducing auxiliary variable , the proximal ADMM updating summarized as:
where are respectively the penalty and regularization parameters, denotes the sign function, is an additional auxiliary variable for the nonconvex upperlevel subproblem, is the corresponding dual multiplier, , ,and .
References
 [1] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Trans. Image Process, vol. 26, no. 9, pp. 4509–4522, 2017.
 [2] S. Lunz, C. Schoenlieb, and O. Öktem, “Adversarial regularizers in inverse problems,” in Proc. Advances in Neural Inf. Process. Systems, 2018, pp. 8516–8525.
 [3] C. Li, W. Yin, H. Jiang, and Y. Zhang, “An efficient augmented lagrangian method with applications to total variation minimization,” Computational Optimization and Applications, vol. 56, no. 3, pp. 507–530, 2013.
 [4] S. A. Bigdeli, M. Zwicker, P. Favaro, and M. Jin, “Deep meanshift priors for image restoration,” in Proc. Advances in Neural Inf. Process. Systems, 2017, pp. 763–772.
 [5] N. Yair and T. Michaeli, “Multiscale weighted nuclear norm image restoration,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 3165–3174.
 [6] Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image superresolution via information distillation network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 723–731.
 [7] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang, “Nonlocal recurrent network for image restoration,” in Proc. Advances in Neural Inf. Process. Systems, 2018, pp. 1680–1689.
 [8] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu, “Residual nonlocal attention networks for image restoration,” in International Conference on Learning Representations, 2019.
 [9] K. Li and J. Malik, “Learning to optimize,” in International Conference on Learning Representations, 2017.
 [10] R. Liu, S. Cheng, L. Ma, X. Fan, Z. Luo et al., “A bridging framework for model optimization and deep propagation,” in Proc. Advances in Neural Inf. Process. Systems, 2018, pp. 4323–4332.
 [11] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2017, pp. 3929–3938.
 [12] R. Liu, S. Cheng, L. Ma, X. Fan, and Z. Luo, “Deep proximal unrolling: Algorithmic framework, convergence analysis and applications,” IEEE Trans. Image Process, 2019.
 [13] R. Liu, S. Cheng, Y. He, X. Fan, Z. Lin, and Z. Luo, “On the convergence of learningbased iterative methods for nonconvex inverse problems,” IEEE Trans. Pattern Anal. Mach. Intell., 2019.
 [14] S. Dempe, V. Kalashnikov, G. A. PrezValds, and N. Kalashnykova, Bilevel Programming Problems: Theory, Algorithms and Applications to Energy Networks. Springer Publishing Company, Incorporated, 2015.
 [15] M. Solodov, “An explicit descent method for bilevel convex optimization,” Journal of Convex Analysis, vol. 14, no. 2, p. 227, 2007.
 [16] A. Beck and S. Sabach, “A first order method for finding minimal normlike solutions of convex optimization problems,” Mathematical Programming, vol. 147, no. 12, pp. 25–46, 2014.
 [17] S. Sabach and S. Shtern, “A first order method for solving convex bilevel optimization problems,” SIAM Journal on Optimization, vol. 27, no. 2, pp. 640–660, 2017.
 [18] E. M. Eksioglu, “Decoupled algorithm for mri reconstruction using nonlocal block matching model: Bm3dmri,” J MATH IMAGING VIS, vol. 56, no. 3, pp. 430–440, 2016.
 [19] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep backprojection networks for superresolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 1664–1673.
 [20] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia, “Scalerecurrent network for deep image deblurring,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 8174–8182.
 [21] X. Wang, R. Girshick, A. Gupta, and K. He, “Nonlocal neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 7794–7803.
 [22] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image superresolution: Dataset and study,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit. Workshops, 2017.
 [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2016, pp. 770–778.
 [24] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask rcnn,” in Proc. IEEE Conf. Int. Conf. Comput. Vis., 2017, pp. 2980–2988.
 [25] L. Fan, W. Huang, C. Gan, S. Ermon, B. Gong, and J. Huang, “Endtoend learning of motion representation for video understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 6016–6025.
 [26] J. Zhang and B. Ghanem, “Istanet: Interpretable optimizationinspired deep network for image compressive sensing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 1828–1837.
 [27] J. Sun, H. Li, Z. Xu et al., “Deep admmnet for compressive sensing mri,” in Proc. Advances in Neural Inf. Process. Systems, 2016, pp. 10–18.
 [28] S. Dempe, “Bilevel optimization: theory, algorithms and applications,” Optimization Online URL http://www. optimizationonline. org/DB_HTML/2018/08/6773. html, 2018.
 [29] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, 2009.
 [30] M.M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.M. Hu, “Global contrast based salient region detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, 2015.
 [31] R. Liu, L. Ma, Y. Wang, and L. Zhang, “Learning converged propagations with deep prior ensemble for image enhancement,” IEEE Trans. Image Process, vol. 28, no. 3, pp. 1528–1543, 2019.
 [32] A. L. Dontchev and R. T. Rockafellar, “Implicit functions and solution mappings,” Springer Monogr. Math., 2009.
 [33] R. G. Baraniuk, “Compressive sensing,” IEEE Signal Processing Magazine, vol. 24, no. 4, 2007.
 [34] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, “Understanding and evaluating blind deconvolution algorithms,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2009, pp. 1964–1971.
 [35] J. M. BioucasDias and M. A. Figueiredo, “A new twist: twostep iterative shrinkage/thresholding algorithms for image restoration,” IEEE Trans. Image Process, vol. 16, no. 12, pp. 2992–3004, 2007.
 [36] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional superresolution network for multrans. image processle degradations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2018, pp. 3262–3271.
 [37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
 [38] A. Beck and M. Teboulle, “A fast iterative shrinkagethresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
 [39] D. Krishnan and R. Fergus, “Fast image deconvolution using hyperlaplacian priors,” in Proc. Advances in Neural Inf. Process. Systems, 2009, pp. 1033–1041.
 [40] A. Danielyan, V. Katkovnik, and K. Egiazarian, “Bm3d frames and variational image deblurring,” IEEE Trans. Image Process, vol. 21, no. 4, pp. 1715–1728, 2012.
 [41] D. Zoran and Y. Weiss, “From learning models of natural image patches to whole image restoration,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2011, pp. 479–486.
 [42] C. J. Schuler, H. C. Burger, S. Harmeling, and B. Scholkopf, “A machine learning approach for nonblind image deconvolution,” in Proc. IEEE Conf. Comput. Vis. Pattern Recongnit., 2013, pp. 1067–1074.
 [43] M. Lustig, D. L. Donoho, J. M. Santos, and J. M. Pauly, “Compressed sensing mri,” IEEE SIGNAL PROCESSING MAGAZINE, vol. 25, no. 2, pp. 72–82, 2008.
 [44] X. Qu, D. Guo, B. Ning, Y. Hou, Y. Lin, S. Cai, and Z. Chen, “Undersampled mri reconstruction with patchbased directional wavelets,” MAGN RESON IMAGING, vol. 30, no. 7, pp. 964–977, 2012.
 [45] X. Qu, Y. Hou, F. Lam, D. Guo, J. Zhong, and Z. Chen, “Magnetic resonance image reconstruction from undersampled measurements using a patchbased nonlocal operator,” MED IMAGE ANAL, vol. 18, no. 6, pp. 843–856, 2014.
 [46] Z. Zhan, J.F. Cai, D. Guo, Y. Liu, Z. Chen, and X. Qu, “Fast multiclass dictionaries learning with geometrical directions in mri reconstruction,” IEEE Trans. Biomed. Engineering, vol. 63, no. 9, pp. 1850–1861, 2016.

[47]
R. Liu, Y. Zhang, S. Cheng, X. Fan, and Z. Luo, “A theoretically guaranteed
deep optimization framework for robust compressive sensing mri,” in
Proceedings of Association for the Advancement of Artificial Intelligence
, 2019.  [48] B. He, L.Z. Liao, D. Han, and H. Yang, “A new inexact alternating directions method for monotone variational inequalities,” Mathematical Programming, vol. 92, no. 1, pp. 103–118, 2002.
 [49] X. Yuan, S. Zeng, and J. Zhang, “Discerning the linear convergence of admm for structured convex optimization through the lens of variational analysis,” Optimization Online Preprint, 2018.
 [50] J. F. Bonnans and A. Shapiro, Perturbation analysis of optimization problems. Springer Science & Business Media, 2013.
Comments
There are no comments yet.