Introduction
In last years, deep models (a.k.a. deep neural networks) have produced the stateoftheart performance in many application fields, such as image processing, object recognition, natural language processing, and bioinformatics. On the downside, these existing approaches are typically designed based on heuristic understandings of a particular problem, and trained using engineering experience to implement multilayered feature propagations, e.g.,
[Krizhevsky, Sutskever, and Hinton2012, Simonyan and Zisserman2014, He et al.2016]. Therefore, they lack solid theoretical guidances and interpretations. More importantly, it is challenging to incorporate the mathematical rules and principles of the considered task into these existing networks.Alternatively, several recent works, e.g., [Gregor and LeCun2010, Schmidt and Roth2014, Andrychowicz et al.2016], build their networks using a specific optimization model and iteration scheme. The main idea is to unroll numerical algorithms and constitute their network architectures based on the resulted iterations. In this way, these approaches successfully incorporate the information of a predefined energy into the network propagation. Nevertheless, due to the dynamical nature of parameterized iterations, existing theoretical results, especially convergence, from the optimization area are not valid at all. Furthermore, the unrolled deep models are often with limited flexibility and adaptability as the basic architectures are restricted by the particular iteration scheme.
To partially overcome limitations in existing approaches, this work attempts to develop a simple, flexible and efficient framework to build deep models for various realword applications. The launching point of our work is from EnergyBased Models (EBMs)
[Teh et al.2003, Zhao, Mathieu, and LeCun2016]. EBMs are a series of methods, which associate a scalar energy to each configuration of observations and their interested perditions. The inference of EBM consists of searching for a configuration of variables that minimizes the energy. In this work, we consider the following energy minimization formulation(1) 
where and are the observed and predicted variables, respectively, reveals the priors of predictions, and is a measure of compatibility, i.e., fidelity, between and .
We establish a novel proximal framework to unroll the general energy in Eq. (1), and incorporate various experimentally efficient architectures into the resulted deep model. Promising theoretical properties and practical performance will be also demonstrated. The main advantages of our proposed framework against existing optimizationunrolled deep models can be distilled to the following three points.
Insensitive Unrolling Scheme: Most existing iterationunrolling based deep models are strictly confined to some special types of energy formulations. For example, architectures in [Schmidt and Roth2014, Chen and Pock2017] are deduced from the fieldofexperts prior while networks in [Gregor and LeCun2010] are based on regularizations. In contrast, our unrolling strategy only depends on the separable structure and functional properties of , but is completely insensitive to particular forms of and . We can even design deep models without knowing the form of so that our framework adapts to various challenging tasks and complex data distributions.
Flexible Builtin Architecture:
On the one hand, the architectures in existing unrolled networks are deduced from fixed iteration schemes, thus lacking flexibility. On the other hand, it has been revealed in many practical applications that heuristic deep architectures, e.g., ReLU and Batch Normalization, are extremely efficient though absent of theoretic analysis. Our studies show that under some mild conditions, our deep model can incorporate most existing empirically successful network architectures (even built by means of engineering tricks). In other words, we indeed theoretically offer the flexibility for existing deep architectures while taking the advantage of their efficiency.
Convergence Guarantee: A fundamental weakness underlying existing networks is the elusiveness of theoretical analysis. Especially, little to no attention has been paid to the convergence behaviors of deep models^{1}^{1}1 Notice that the concept of “convergence” in this paper is not only related to the propagation of network parameters in the training phase, but the outputs of network architectures for both training and test phases. That is, we consider the output of the th basic architecture as the th element of a sequence, and then investigate the convergence on the resulting sequence.. The main reason is that even building networks by unrolling converged optimization algorithms, the dynamic nature of their parameters and heuristic architectures would still break the convergent guarantee in original iteration scheme. Contrarily, this paper proves that our designed deep models do have nice convergence properties. That is, under some mild conditions, the sequence generated by the proposed proximal alternating direction networks (PADNet) can converge to a criticalpoint of Eq. (1) with relatively simple priors^{2}^{2}2We will formally discuss details of relatively “simple” and “complex” priors in the following sections.. Furthermore, our propagation guarantees at least fixedpoint convergence when handling complex priors, e.g., only with partial task/data information. We argue that these theoretical results are the best we can ask for, unless stronger assumptions are enforced.
Proximal Alternating Direction Network
In this section, we develop an alternating direction type unrolling scheme to generate the propagation sequence (denoted as ) based on the energy model in Eq. (1). As shown in the following, rather than directly calculating from , we would like to first design cascaded propagations of two auxiliary variables (denoted as and ) corresponding to the fidelity and prior of the task, respectively. The residual type deep architectures are then incorporated for subsequence updating. Finally, a novel proximal error correction process is designed to control our propagation.
Alternating Direction Scheme: For each , we introduce the MoreauYosida regularization [Parikh, Boyd, and others2014, Xu, Lin, and Zha2016] of with parameter and auxiliary variable to obtain the following regularized energy model
(2) 
Now the problem is temporarily reduced to calculate based on . One common inference strategy for in Eq. (2) is to introduce another auxiliary variable and a Lagrange multiplier and then perform alternating minimizations to the corresponding augmented Lagrange function, resulting to the following iteration scheme
(3)  
(4)  
(5) 
where is a penalty parameter.
In this way, we actually perform the wellknown Alternating Direction Method of Multiplier (ADMM) [Parikh, Boyd, and others2014, Lin, Liu, and Su2011] for the MoreauYosida regularized approximation of the original energy in Eq. (1) at each iteration.
Builtin Deep Architecture: We then show how to incorporate deep architectures into the above base iteration. Specifically, we consider a residual formulation to replace the subproblem in Eq. (4). That is, we define the propagation of as^{3}^{3}3Formally, we should denote the output of th residual unit at th iteration as . But in this paragraph, we temporarily omit the superscript to simplify the presentation.
(6) 
where is the set of learnable parameters, is a stepsize parameter, is the basic network unit, are the input and output of (at th stage), respectively. Notice that standard training strategies can be directly adopted to optimize parameters of our basic architecture. If necessary, one may further jointly finetune parameters of the whole network after the design phase.
It is easy to check that the network in Eq. (6) actually recursively performs coordinate descent steps (i.e., ) to propagate . So from optimization viewpoint, we interpret
as a descentdirectionestimation architecture for the optimization of the subproblem in Eq. (
4). While in more challenging scenario (e.g., hard to define an explicit and solvable for this subproblem), we can still learn builtin propagation architectures from training data to obtain our desired solution.Proximal Error Correction: Now it is ready to give the formal updating scheme of . We can see that builtin architectures in Eq. (6) actually do not exactly optimize the original energy in Eq. (1). So it is necessary to introduce an additional step to control our propagation at each iteration. Specifically, denote as the output of builtin network in Eq. (6) at th iteration. Then we adopt a proximalgradientlike scheme [Wang et al.2017] to formally update
(7) 
where is Moreau’s proximal operator of .
Overall, our proposed deep model, called Proximal Alternating Direction Network (PADNet), is summarized in Alg. 1. Notice that we actually consider PADNet in two different scenarios, which can be categorized by properties of prior regularization in Eq. (1). That is:

Simple priors: can be computed in closedform.

Complex priors: is intractable or is unknown.
We perform error correction (i.e., Step 4 in Alg. 1) in the first case (Explicit PADNet or EPADNet for short) but directly propagate the output of builtin networks (i.e., Step 3 in Alg. 1) in the second case (Implicit PADNet or IPADNet for short). Theoretical results for these two different scenarios will be respectively proved in the next section.
Learning with Convergence Guarantee
In general, unrolling taskaware optimization schemes may incorporate rich domainknowledge into the network structure. Unfortunately, the sequence generated by most existing unrolled deep models will no longer have convergence guarantee, even though nice theoretical results have been proved and verified for their original optimization schemes.
Fortunately, we in this work demonstrate that under some mild conditions, the propagation generated by our PADNet is globally converged^{4}^{4}4Notice that “globally converged” in this paper is in the sense that the whole sequence generated by our deep model is converged and this concept has been widely used in nonconvex optimization [Attouch et al.2010] society., even with builtin network architectures designed in heuristic manners.
Convergence Behavior Analysis of PADNet
To make our paper selfcontained, some necessary definitions should be presented before the formal analysis. Indeed, these concepts have been widely known in variational analysis and optimization and one may refer to [Rockafellar and Wets2009, Attouch et al.2010] and references therein for more details.
Definition 1.
We give necessary definitions, including proper and lower semicontinuous, coercive and semialgebraic.

A function is said to be proper and lower semicontinuous if , where and at any point .

A function is said to be coercive, if is bounded from below and if , where is the norm.

A subset of is a real semialgebraic set if there exist a finit number of real polynomial functions such that
(8) A function is called semialgebraic if its graph is a semialgebraic subset of .
Remark 1.
Indeed, many functions arising in learning and vision areas, including norm, rational norms (i.e., with positive integers and ) used in our experimental part and their finite sums or products, are all semialgebraic.
In the following, we first analyze PADNet for tasks with simple priors. Specifically, given a variable , we estimate the discrepancy between it and the optimal solution of Eq. (2) by the function
(9) 
Here is deduced based on the firstorder optimality condition of Eq. (2) at th iteration. Then with the following simple error condition, we prove in Theorem 1 that the propagation of EPADNet indeed globally converges to a criticalpoint of Eq. (1). Please refer to supplemental materials for necessary preliminaries and all proofs of the proposed theories in this paper.
Condition 1.
(Error Condition) The error function (in Eq. (9)) at th iteration should satisfy , where is a universal constant.
Theorem 1.
(CriticalPoint Convergence of Explicit PADNet) Let be continuous differential, be proper and lower semicontinuous and be coercive. Then EPADNet converges to a critical point of Eq. (1) under Condition 1. That is, generated by EPADNet is a bounded sequence and its any cluster point is a critical point of Eq. (1) (i.e., satisfying ). Furthermore, if is semialgebraic, then globally converges to a critical point of Eq. (1).
Remark 2.
With the semialgebraic property of , we can also obtain convergence rate of EPADNet based on a particular desingularizing function with a constant and parameter (defined in [Chouzenoux, Pesquet, and Repetti2016]). Specifically, the sequence converges after finite iterations if . The linear and sublinear rates can be obtained if choosing and , respectively.
It has been verified that broad class of functions arising in learning problems (even nonconvex and nonsmooth) satisfy assumptions in Theorem 1. For example, both norm and rational norm with (i.e., , and are positive integers) are proper, lower semicontinuous and semialgebraic.
Based on above analysis, we propose a learning framework (summarized in Alg. 2) to adaptively design and train globally converged deep models for different learning tasks.
Remark 3.
Theorem 1 together with Alg. 2 actually provides a flexible framework with solid theoretical guarantee for deep model design and we only need to check whether builtin networks satisfy Condition 1 during their design phase. Furthermore, in general, any architectures satisfying this condition (even designed in engineering manner) can be incorporated into our deep models.
In contrast, when handling tasks with complex priors, neither error checking (i.e., Step 6 in Alg. 2) during design and training nor error correction (i.e., Step 4 in Alg. 1) during test will be performed. Therefore, we cannot obtain the same convergence results as that in Theorem 1. Fortunately, by enforcing another easily satisfied condition to builtin architectures, we would still prove a fixedpoint convergence guarantee for IPADNet.
Condition 2.
(Architecture Condition) For any given input , the architecture should satisfy , where is a universal constant.
Notice that this bound condition is relatively weak and we can check that most commonly used linear and nonlinear operations in existing deep networks satisfy it.
Theorem 2.
(FixedPoint Convergence of Implicit PADNet) Let be continuous differential with bounded gradients. Then IPADNet is converged under Condition 2. That is, generated by IPADNet is a Cauchy sequence, so that is globally converged to a fixedpoint.
Remark 4.
Theorem 2 actually provides a theoretically guaranteed paradigm to fuse both analytical and empirical informations to build deep models for challenging learning tasks. That is to say, we can simultaneously design modelbased fidelity function to reveal our theoretical understandings of the problem and learn complex priors from training data by modelfree network architecture .
To end our analysis, we emphasize that the above convergence results are the best we can ask for unless other stronger assumptions are made on the given learning task.
Implementable Error Calculation
It can be observed in Eq (9) that directly calculating using its theoretical definition is challenging due to the subgradient term . So we provide a calculable formulation based on the following derivations. Specifically, using Eq. (7), we have
(10) 
By setting in Eq. (10) and following Theorem 1, we directly have that if , then
(11) 
Therefore, we actually obtain the following implementable error calculation formulation for
(12) 
Discussions
Intuitively, one may argue that building a deeper network should definitely result good performance. But unfortunately, many empirical evidences [Simonyan and Zisserman2014] have suggested that the improvement cannot be trivially gained by simply adding more layers, or worse, deeper networks even suffer from a decline on performance in some applications [Shen, Lin, and Huang2016]. Therefore, it is particularly worthy of investigating the intrinsic propagation behaviors for networks with different topological structures and architectures from more solid theoretical perspective.
Indeed, our above theories have built intrinsic theoretical connections between unrolled deep models and original numerical schemes. We also investigate conditions for incorporating heuristic architectures into the proposed deep model. Therefore, the studies in this paper should provide a new perspective and introduce several powerful tools from optimization area to address the challenging but fundamental issues discussed in above paragraph.
Experiments
To verify our theoretical results and demonstrate the effectiveness of our deep models in application fields, we apply PADNet on two realworld applications, i.e., nonblind deconvolution and single image haze removal. All experiments are conduced on a PC with Intel Core i7 CPU at 3.4 GHz, 32 GB RAM and a NVIDIA GeForce GTX 1050 Ti GPU.
Nonblind Deconvolution
Ave.  Number of Iterations (denoted as )  
Alg.  ADMM  HQS  IPADNet  EPADNet  ADMM  HQS  IPADNet  EPADNet 
29  33  6  5  1.3146  1.3354  0.4291  0.4516  
28  32  6  5  1.2069  1.2215  0.4291  0.4140  
40  54  6  6  1.7114  1.7598  0.4291  0.6005 
We first consider nonblind deconvolution, which is an important task in learning and vision areas. Specifically, given an observation (e.g., image), the latent signal can be processed in a filtered domain as follows [Krishnan and Fergus2009, Schmidt and Roth2014] , where is a set of filters (e.g., horizontal and vertical gradient operations), denotes convolution, is a point spread function and
denotes errors/noises. This problem can be formulated as the maximumaposteriori estimation
(13) 
Here we follow typical choices to consider fidelity (i.e., ) and regularization (i.e., , ), where is the parameter. We adopt results in [Zuo et al.2013] to calculate the proximal operation of general minimization. In following deconvolution experiments, we always use the set of images of size built in [Chen and Pock2017] as our training data. Two commonly used image deblurring benchmarks respectively collected by Levin et. al. [Levin et al.2009] (32 blurry images of size ) and Sun et. al. [Sun et al.2013] (640 blurry images with Gaussian noises, sizes range from to ) are used for testing.
Convergence Behaviors on Gradient Domain
The gradient of images plays very important role in image structure analysis. Here we first consider deconvolution on gradient domain to verify the convergence behaviors of our designed deep models to a given energy with a simple prior. Specifically, the energy in gradient domain is defined as
(14) 
where denotes the gradient of the latent image. We first build the basic architecture as cascade of two convolutions with one RBF nonlinearity [Schmidt and Roth2014] between them. Then we perform Alg. 2 based on Eq. (14) with to respectively design three EPADNet models. We also establish an IPADNet model from Alg. 2 with only the fidelity . To compare iteration behaviors with conventional optimization strategies, we also perform popular ADMM and HalfQuadratic Splitting (HQS) [Zuo et al.2013] algorithms on Eq. (14) with the same regularizer and parameters.
The averaged convergence results of compared algorithms on Levin et. al.’ benchmark are reported in Tab. 1. As IPDANet does not depend on functions, we just repeated its results for three cases (i.e., ) in this table.
It can be seen that our designed deep models (i.e., one IPADNet and three EPADNets) need extremely less iterations but obtain more accurate estimations than conventional optimization schemes. Moreover, the performance of IPDANet is better than EPADNets regularized by and , but a little worse than the energy. These results make sense because the prior learned from training data should perform better than the relatively improper handcrafted priors (e.g., and norms in this task). If the prior function can fit the data distribution well (e.g., norm here), the criticalpoint convergence guarantee of EPADNet will definitely result better performance, compared with the relatively weak fixedpoint convergence of IPADNet.
Alg.  TV  HL  Ours (E)  EPLL  IDDBM3D  MLP  CSF  Ours (I)  

Levin  PSNR  29.38  30.12  33.41  31.65  31.53  31.32  32.74  33.37 
SSIM  0.88  0.90  0.95  0.93  0.90  0.90  0.93  0.95  
Sun  PSNR  30.67  31.03  32.69  32.44  30.79  31.47  31.55  32.71 
SSIM  0.85  0.85  0.89  0.88  0.87  0.86  0.87  0.89 
Alg.  He et. al.  Meng et. al.  Chen et. al.  Berman et. al.  Li et. al.  Ren et. al.  Cai et. al.  Ours 

PSNR  27.11  26.13  26.47  26.09  25.54  24.40  21.63  28.47 
SSIM  0.96  0.95  0.93  0.95  0.94  0.94  0.89  0.96 
Time (s)  17.20  6.95  272.07  3.73  62.67  5.87  5.77  2.74 
We also plot curves of relative errors (i.e., iteration error and reconstruction error ) and error condition (referring to and ) on an example image from this benchmark in Fig. 1, where denotes the groundtruth image gradient. To provide more readable illustrations of convergence behaviors, here all relative errors are plotted starting from . We also show zoomed in curve comparisons of our two deep models in Fig. 1 (b). Notice that we indeed only have one implicit deep model for this task. But to compare its performance with methods based on different energies, we just repeatedly plot its relative errors (as green curves) in multiple subfigures.
It is observed in Fig. 1 (a) that our deep models always converged within  iterations, while both ADMM and HQS needed dozens of steps to stop their iterations. The dashdot curves in Fig. 1 (a) show that the designed EPADNet satisfied the constraint of errors in Condition 1 all the time, thus the global convergence to the criticalpoint of Eq. (14) can be experimentally guaranteed. All these results verified our proved theories. We can further see in Fig. 1 (b) that propagations of our two deep models (solid red and green curves) had obtained significantly lower reconstruction errors than conventional algorithms even just after the first iteration (i.e., the initial points of these curves). This is because builtin networks actually learned a direct descent direction toward the desired solutions, which demonstrated the superiority of our framework again.
Explicit / Implicit PADNet on Image Domain
Nonblind deconvolution on image domain is commonly formulated as the following energy minimization task [Li et al.2013, Krishnan and Fergus2009, Schmidt and Roth2014]
(15) 
in which the fidelity can be formulated as
(16) 
Here is a penalty parameter, and are variables in image and gradient domains, respectively.
In this part, we build an explicit PADNet using Eq. (15) to pursuit , in which we set and introduce an additional linear layer derived by to transfer variables from gradient domain to image domain. In contrast, by simply defining and discarding explicit priors, we can also design an implicit PADNet to learn priors from training data for this task. Here the basic architecture (used in our deep models) consists of convolution layers. The ReLU nonlinearities are added between each two linear layers accordingly and batch normalizations (BN) [Ioffe and Szegedy2015] are also introduced for convolution operations from nd to th linear layers.
We compare performances of our two deep models against stateoftheart algorithms, including TV [Li et al.2013], HL [Krishnan and Fergus2009], EPLL [Zoran and Weiss2011], IDDBM3D [Danielyan, Katkovnik, and Egiazarian2012], MLP [Schuler et al.2013] and CSF [Schmidt and Roth2014] on both standard Levin et. al.’ and more challenging Sun et. al.’ benchmarks. The averaged quantitative results (i.e., PSNR and SSIM), are reported in Tab. 2, in which “(E)” and “(I)” denote algorithms based on explicit and implicit PADNet, respectively. We can recognize that “Ours (E)” and the works in [Li et al.2013, Krishnan and Fergus2009] actually all address this task by optimizing Eq. (15). Thanks to builtin networks, we achieved much better performance than conventional optimization approaches. We further observed that discriminative learning approaches [Schmidt and Roth2014, Schuler et al.2013] also performed well as they learn adaptive networks from training data. Overall, the results of our two algorithms are better than other compared approaches. The PSNR score of “Ours (E)” is even higher than that of “Ours (I)” on standard Levin et. al.’s dataset. We argue that this is reasonable because prior actually has been powerful enough for relatively simple test images. While “Ours (I)” obtained the best quantitative results on Sun et. al.’s dataset, which demonstrated that our prioranddata aggregated framework is especially more efficient on realworld challenging applications (see Remark 4).
Single Image Haze Removal
Finally, we evaluate PADNet on the task of single image haze removal, which is a challenging realworld vision application. Most existing works address this task as estimating the latent scene radiance from given hazy observation
from the following linear interpolation formula
(17) 
where is transmission, is global atmospheric light and denotes the pixel index.
It is known that transmission expresses the relative portion of light that managed to survive the entire path between the observer and a surface point in the scene without being scattered [Fattal2014]. With Eq. (17), we have that estimating accurate transmission map plays the core role in this task. However, due to multiple solutions exist for a single hazy image, the problem is highly illposed. Recent works often design their models based on different perspectives on transmissions within the following prior regularized energy
(18) 
where and
respectively denote the discrete transmission vector and its propagation guidance. The regularization
can be derived based on different tools, e.g., Total Generalized Variation (TGV) [Chen, Do, and Wang2016] and Markov Random Field (MRF) [Meng et al.2013]. In , denotes an auxiliary variable, is Hadamard product and are weight vectors for local filters .In this part, we first utilize implicit strategy to design PADNet based on fidelity (with the same guidance defined in [Meng et al.2013]) to estimate transmission and then recover the latent scene radiance from Eq. (17) as that in [He, Sun, and Tang2011, Chen, Do, and Wang2016, Meng et al.2013]. We build basic architecture with convolution layers (ReLU and BN operations are incorporated using the same strategy as that in above image deconvolution task) and train it on synthetic hazy images [Ren et al.2016] for our deep model.
We evaluate the performance of our deep model together with five existing handcraftedprior based algorithms (i.e., [He, Sun, and Tang2011, Meng et al.2013, Chen, Do, and Wang2016, Berman, Avidan, and others2016, Li et al.2014]) and two empirically designed deep networks (i.e., [Ren et al.2016, Cai et al.2016])^{5}^{5}5In this subsection, we always denote these methods as He et. al., Meng et. al., Chen et. al., Berman et. al., Li et. al., Ren et. al. and Cai et. al., respectively. on the commonly used Fattal’s benchmark [Fattal2008], which consists of challenging hazy images, including architecture, natural scenery and indoor scene. The averaged quantitative results, including PSNR, SSIM and running time in seconds (denoted as “Time (s)”), are given in Tab. 3. Two empirically designed networks in [Cai et al.2016, Ren et al.2016] performed better than most conventional priorbased methods. Though obtained good dehazing results, the work in [Chen, Do, and Wang2016] has the longest running time. Our proposed deep model achieved the best performance among all compared algorithms on this benchmark. This is mainly because that PADNet can successfully fuse cues from both human perspectives and training data to estimate haze distributions. Furthermore, the speed of PADNet is the fastest among all compared methods, which also verified the efficiency of our framework.
Conclusions
This paper proposed a novel framework, named proximal alternating direction network (PADNet), to design deep models for different learning tasks. Our theoretical results first showed that we can utilize empirically designed architectures to build globally converged PADNet for the given energy minimization model. We further proved that a converged PADNet can also be designed by learning priors from training data. At last we experimentally verified our analysis and demonstrated promising results of PADNet on different realworld applications.
Acknowledgments
This work is partially supported by the National Natural Science Foundation of China (Nos. 61672125, 61632019, 61432003, 61572096 and 61733002), and the Hong Kong Scholar Program (No. XJ2015008). Dr. Liu is also a visiting researcher with Shenzhen Key Laboratory of Media Security, Shenzhen University, Shenzhen 518060
Supplemental Material
Preliminaries
The following definitions and lemmas have been widely known in variational analysis and optimization. More details can also be found in [Rockafellar and Wets2009, Attouch et al.2010] and references therein.
Definition 2.
Here we give necessary definitions, including proper and lower semicontinuous, coercive and semialgebraic.

A function is said to be proper and lower semicontinuous if , where and at any point .

A function is said to be coercive, if is bounded from below and if , where is the norm.

A subset of is a real semialgebraic set if there exist a finit number of real polynomial functions such that
(19) A function is called semialgebraic if its graph is a semialgebraic subset of .
Indeed, many functions arising in learning areas, including norm, rational norms (i.e., with positive integers and ) and their finite sums or products, are all semialgebraic.
Lemma 1.
Here we list some necessary properties will be used in our following proofs.

In the nonsmooth context, the Fermat’s rule remains unchanged. That is, if is a local minimizer of , then .

If is a continuously differentiable function, then .

Let be proper and lower semicontinuous and be a sequence in the graph of . If and as , then is also in the graph of .
Finally, we recall that a critical point of a function is a point in the domain of , whose subdifferential contains .
Proof of Theorem 1
Proof.
The error condition of in (9) can be equivalently reformulated as
(20) 
where . Therefore, the propagation behavior of Alg. 2 actually should be understood as exactly solving Eq. (20) to output at th iteration. Thus
(21) 
in which can be any positive constant but the last equality only holds with . So we have . As is coercive, we also have that is bounded. Furthermore, as is nonincreasing, it converges to a constant (denoted as ). So summing (21) from to leads to
(22) 
Using , we have if .
Let be any cluster point of , i.e., if . Since is lower semicontinuous, we have
(25) 
From (20), we have
(26) 
Setting in (26) and letting , we obtain
(27) 
in which we used the facts that is continuous and bounded. So we have that tends to as . This together with the continuity of directly results to
(28) 
Using (24), (28) and Lemma 1, we have that . So any cluster point of is a critical point of Eq. (1).
If we further assume that is a semialgebraic function, then it satisfies the wellknown KurdykaŁojasiewicz property [Rockafellar and Wets2009, Attouch et al.2010]. So we have that following [Attouch et al.2010]. This implies that is a Cauchy sequence and hence a convergent sequence. Since we have proved in above that if , we finally have if , which completes the proof. ∎
Proof of Theorem 2
Proof.
The updating scheme of in Alg. 2 implies
(29) 
Moreover, it is easy to check that is bounded, i.e. there exists such that . So at th iteration, we have
(30) 
Next, define . Then following Steps 5 and 8 in Alg. 2 and the bounded constraint on , we have
(31) 
Using Eq. (30) and (31), we can show
(32) 
As for , we have
(33) 
Following Step 11 in Alg. 2, we have
(34) 
So is a Cauchy sequence and hence there exists fixed point such that if , which completes the proof. ∎
References
 [Andrychowicz et al.2016] Andrychowicz, M.; Denil, M.; Gomez, S.; Hoffman, M. W.; Pfau, D.; Schaul, T.; and de Freitas, N. 2016. Learning to learn by gradient descent by gradient descent. In NIPS, 3981–3989.
 [Attouch et al.2010] Attouch, H.; Bolte, J.; Redont, P.; and Soubeyran, A. 2010. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdykałojasiewicz inequality. Mathematics of Operations Research 35(2):438–457.
 [Berman, Avidan, and others2016] Berman, D.; Avidan, S.; et al. 2016. Nonlocal image dehazing. In CVPR, 1674–1682.
 [Cai et al.2016] Cai, B.; Xu, X.; Jia, K.; Qing, C.; and Tao, D. 2016. Dehazenet: An endtoend system for single image haze removal. IEEE TIP 25(11):5187–5198.
 [Chen and Pock2017] Chen, Y., and Pock, T. 2017. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE TPAMI 39(6):1256–1272.
 [Chen, Do, and Wang2016] Chen, C.; Do, M. N.; and Wang, J. 2016. Robust image and video dehazing with visual artifact suppression via gradient residual minimization. In ECCV, 576–591.
 [Chouzenoux, Pesquet, and Repetti2016] Chouzenoux, E.; Pesquet, J.C.; and Repetti, A. 2016. A block coordinate variable metric forward–backward algorithm. Journal of Global Optimization 66(3):457–485.
 [Danielyan, Katkovnik, and Egiazarian2012] Danielyan, A.; Katkovnik, V.; and Egiazarian, K. 2012. Bm3d frames and variational image deblurring. IEEE TIP 21(4):1715–1728.
 [Fattal2008] Fattal, R. 2008. Single image dehazing. ACM Transactions on Graphics (TOG) 27(3):72.
 [Fattal2014] Fattal, R. 2014. Dehazing using colorlines. ACM Transactions on Graphics (TOG) 34(1):13.
 [Gregor and LeCun2010] Gregor, K., and LeCun, Y. 2010. Learning fast approximations of sparse coding. In ICML, 399–406.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
 [He, Sun, and Tang2011] He, K.; Sun, J.; and Tang, X. 2011. Single image haze removal using dark channel prior. IEEE TPAMI 33(12):2341–2353.
 [Ioffe and Szegedy2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 448–456.
 [Krishnan and Fergus2009] Krishnan, D., and Fergus, R. 2009. Fast image deconvolution using hyperlaplacian priors. In NIPS, 1033–1041.
 [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1097–1105.
 [Levin et al.2009] Levin, A.; Weiss, Y.; Durand, F.; and Freeman, W. T. 2009. Understanding and evaluating blind deconvolution algorithms. In CVPR, 1964–1971.
 [Li et al.2013] Li, C.; Yin, W.; Jiang, H.; and Zhang, Y. 2013. An efficient augmented lagrangian method with applications to total variation minimization. Computational Optimization and Applications 56(3):507–530.
 [Li et al.2014] Li, Y.; Guo, F.; Tan, R. T.; and Brown, M. S. 2014. A contrast enhancement framework with jpeg artifacts suppression. In ECCV, 174–188.
 [Lin, Liu, and Su2011] Lin, Z.; Liu, R.; and Su, Z. 2011. Linearized alternating direction method with adaptive penalty for lowrank representation. In NIPS, 612–620.
 [Meng et al.2013] Meng, G.; Wang, Y.; Duan, J.; Xiang, S.; and Pan, C. 2013. Efficient image dehazing with boundary constraint and contextual regularization. In ICCV, 617–624.
 [Parikh, Boyd, and others2014] Parikh, N.; Boyd, S.; et al. 2014. Proximal algorithms. Foundations and Trends® in Optimization 1(3):127–239.
 [Ren et al.2016] Ren, W.; Liu, S.; Zhang, H.; Pan, J.; Cao, X.; and Yang, M.H. 2016. Single image dehazing via multiscale convolutional neural networks. In ECCV, 154–169.
 [Rockafellar and Wets2009] Rockafellar, R. T., and Wets, R. J.B. 2009. Variational analysis, volume 317. Springer Science & Business Media.
 [Schmidt and Roth2014] Schmidt, U., and Roth, S. 2014. Shrinkage fields for effective image restoration. In CVPR, 2774–2781.

[Schuler et al.2013]
Schuler, C. J.; Christopher Burger, H.; Harmeling, S.; and Scholkopf, B.
2013.
A machine learning approach for nonblind image deconvolution.
In CVPR, 1067–1074. 
[Shen, Lin, and Huang2016]
Shen, L.; Lin, Z.; and Huang, Q.
2016.
Relay backpropagation for effective learning of deep convolutional neural networks.
In ECCV, 467–482. Springer.  [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556.
 [Sun et al.2013] Sun, L.; Cho, S.; Wang, J.; and Hays, J. 2013. Edgebased blur kernel estimation using patch priors. In ICCP.
 [Teh et al.2003] Teh, Y. W.; Welling, M.; Osindero, S.; and Hinton, G. E. 2003. Energybased models for sparse overcomplete representations. JMLR 4(Dec):1235–1260.
 [Wang et al.2017] Wang, Y.; Liu, R.; Song, X.; and Su, Z. 2017. An inexact proximal alternating direction method for nonconvex and nonsmooth matrix factorization and beyond. arXiv preprint arXiv:1702.08627.
 [Xu, Lin, and Zha2016] Xu, C.; Lin, Z.; and Zha, H. 2016. Relaxed majorizationminimization for nonsmooth and nonconvex optimization. In AAAI, 812–818.
 [Zhao, Mathieu, and LeCun2016] Zhao, J.; Mathieu, M.; and LeCun, Y. 2016. Energybased generative adversarial network. arXiv preprint arXiv:1609.03126.
 [Zoran and Weiss2011] Zoran, D., and Weiss, Y. 2011. From learning models of natural image patches to whole image restoration. In ICCV, 479–486.
 [Zuo et al.2013] Zuo, W.; Meng, D.; Zhang, L.; Feng, X.; and Zhang, D. 2013. A generalized iterated shrinkage algorithm for nonconvex sparse coding. In ICCV, 217–224.
Comments
There are no comments yet.