# Proximal Alternating Direction Network: A Globally Converged Deep Unrolling Framework

Deep learning models have gained great success in many real-world applications. However, most existing networks are typically designed in heuristic manners, thus lack of rigorous mathematical principles and derivations. Several recent studies build deep structures by unrolling a particular optimization model that involves task information. Unfortunately, due to the dynamic nature of network parameters, their resultant deep propagation networks do not possess the nice convergence property as the original optimization scheme does. This paper provides a novel proximal unrolling framework to establish deep models by integrating experimentally verified network architectures and rich cues of the tasks. More importantly, we prove in theory that 1) the propagation generated by our unrolled deep model globally converges to a critical-point of a given variational energy, and 2) the proposed framework is still able to learn priors from training data to generate a convergent propagation even when task information is only partially available. Indeed, these theoretical results are the best we can ask for, unless stronger assumptions are enforced. Extensive experiments on various real-world applications verify the theoretical convergence and demonstrate the effectiveness of designed deep models.

## Authors

• 33 publications
• 26 publications
• 6 publications
• 17 publications
• 23 publications
• ### Investigating Task-driven Latent Feasibility for Nonconvex Image Modeling

Properly modeling the latent image distributions always plays a key role...
10/18/2019 ∙ by Risheng Liu, et al. ∙ 14

Along with the development of learning and vision, Alternating Direction...
09/24/2019 ∙ by Risheng Liu, et al. ∙ 9

• ### Learning Optimization-inspired Image Propagation with Control Mechanisms and Architecture Augmentations for Low-level Vision

In recent years, building deep learning models from optimization perspec...
12/10/2020 ∙ by Risheng Liu, et al. ∙ 3

• ### Bilevel Integrative Optimization for Ill-posed Inverse Problems

Classical optimization techniques often formulate the feasibility of the...
07/06/2019 ∙ by Risheng Liu, et al. ∙ 1

• ### Learning Aggregated Transmission Propagation Networks for Haze Removal and Beyond

Single image dehazing is an important low-level vision task with many ap...
11/18/2017 ∙ by Risheng Liu, et al. ∙ 0

• ### Learning Converged Propagations with Deep Prior Ensemble for Image Enhancement

Enhancing visual qualities of images plays very important roles in vario...
10/09/2018 ∙ by Risheng Liu, et al. ∙ 8

• ### Are All Layers Created Equal?

Understanding learning and generalization of deep architectures has been...
02/06/2019 ∙ by Chiyuan Zhang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Introduction

In last years, deep models (a.k.a. deep neural networks) have produced the state-of-the-art performance in many application fields, such as image processing, object recognition, natural language processing, and bioinformatics. On the downside, these existing approaches are typically designed based on heuristic understandings of a particular problem, and trained using engineering experience to implement multi-layered feature propagations, e.g.,

[Krizhevsky, Sutskever, and Hinton2012, Simonyan and Zisserman2014, He et al.2016]. Therefore, they lack solid theoretical guidances and interpretations. More importantly, it is challenging to incorporate the mathematical rules and principles of the considered task into these existing networks.

Alternatively, several recent works, e.g., [Gregor and LeCun2010, Schmidt and Roth2014, Andrychowicz et al.2016], build their networks using a specific optimization model and iteration scheme. The main idea is to unroll numerical algorithms and constitute their network architectures based on the resulted iterations. In this way, these approaches successfully incorporate the information of a predefined energy into the network propagation. Nevertheless, due to the dynamical nature of parameterized iterations, existing theoretical results, especially convergence, from the optimization area are not valid at all. Furthermore, the unrolled deep models are often with limited flexibility and adaptability as the basic architectures are restricted by the particular iteration scheme.

To partially overcome limitations in existing approaches, this work attempts to develop a simple, flexible and efficient framework to build deep models for various real-word applications. The launching point of our work is from Energy-Based Models (EBMs)

[Teh et al.2003, Zhao, Mathieu, and LeCun2016]. EBMs are a series of methods, which associate a scalar energy to each configuration of observations and their interested perditions. The inference of EBM consists of searching for a configuration of variables that minimizes the energy. In this work, we consider the following energy minimization formulation

 infx{F(x):=f(x;y)+r(x)}, (1)

where and are the observed and predicted variables, respectively, reveals the priors of predictions, and is a measure of compatibility, i.e., fidelity, between and .

We establish a novel proximal framework to unroll the general energy in Eq. (1), and incorporate various experimentally efficient architectures into the resulted deep model. Promising theoretical properties and practical performance will be also demonstrated. The main advantages of our proposed framework against existing optimization-unrolled deep models can be distilled to the following three points.

Insensitive Unrolling Scheme: Most existing iteration-unrolling based deep models are strictly confined to some special types of energy formulations. For example, architectures in [Schmidt and Roth2014, Chen and Pock2017] are deduced from the field-of-experts prior while networks in [Gregor and LeCun2010] are based on -regularizations. In contrast, our unrolling strategy only depends on the separable structure and functional properties of , but is completely insensitive to particular forms of and . We can even design deep models without knowing the form of so that our framework adapts to various challenging tasks and complex data distributions.

Flexible Built-in Architecture:

On the one hand, the architectures in existing unrolled networks are deduced from fixed iteration schemes, thus lacking flexibility. On the other hand, it has been revealed in many practical applications that heuristic deep architectures, e.g., ReLU and Batch Normalization, are extremely efficient though absent of theoretic analysis. Our studies show that under some mild conditions, our deep model can incorporate most existing empirically successful network architectures (even built by means of engineering tricks). In other words, we indeed theoretically offer the flexibility for existing deep architectures while taking the advantage of their efficiency.

Convergence Guarantee: A fundamental weakness underlying existing networks is the elusiveness of theoretical analysis. Especially, little to no attention has been paid to the convergence behaviors of deep models111 Notice that the concept of “convergence” in this paper is not only related to the propagation of network parameters in the training phase, but the outputs of network architectures for both training and test phases. That is, we consider the output of the -th basic architecture as the -th element of a sequence, and then investigate the convergence on the resulting sequence.. The main reason is that even building networks by unrolling converged optimization algorithms, the dynamic nature of their parameters and heuristic architectures would still break the convergent guarantee in original iteration scheme. Contrarily, this paper proves that our designed deep models do have nice convergence properties. That is, under some mild conditions, the sequence generated by the proposed proximal alternating direction networks (PADNet) can converge to a critical-point of Eq. (1) with relatively simple priors222We will formally discuss details of relatively “simple” and “complex” priors in the following sections.. Furthermore, our propagation guarantees at least fixed-point convergence when handling complex priors, e.g., only with partial task/data information. We argue that these theoretical results are the best we can ask for, unless stronger assumptions are enforced.

## Proximal Alternating Direction Network

In this section, we develop an alternating direction type unrolling scheme to generate the propagation sequence (denoted as ) based on the energy model in Eq. (1). As shown in the following, rather than directly calculating from , we would like to first design cascaded propagations of two auxiliary variables (denoted as and ) corresponding to the fidelity and prior of the task, respectively. The residual type deep architectures are then incorporated for subsequence updating. Finally, a novel proximal error correction process is designed to control our propagation.

Alternating Direction Scheme: For each , we introduce the Moreau-Yosida regularization [Parikh, Boyd, and others2014, Xu, Lin, and Zha2016] of with parameter and auxiliary variable to obtain the following regularized energy model

 MμkF(xk)=infu{fμkxk(u)+r(u)},wherefμkxk(u):=f(u)+μk2∥u−xk∥2. (2)

Now the problem is temporarily reduced to calculate based on . One common inference strategy for in Eq. (2) is to introduce another auxiliary variable and a Lagrange multiplier and then perform alternating minimizations to the corresponding augmented Lagrange function, resulting to the following iteration scheme

 uk+1=argminufμkxk(u)+ρk2∥u−(vk−λk)∥2, (3) vk+1∈argminvr(v)+ρk2∥v−(uk+1+λk)∥2, (4) λk+1=λk+(uk+1−vk+1), (5)

where is a penalty parameter.

In this way, we actually perform the well-known Alternating Direction Method of Multiplier (ADMM) [Parikh, Boyd, and others2014, Lin, Liu, and Su2011] for the Moreau-Yosida regularized approximation of the original energy in Eq. (1) at each iteration.

Built-in Deep Architecture: We then show how to incorporate deep architectures into the above base iteration. Specifically, we consider a residual formulation to replace the subproblem in Eq. (4). That is, we define the propagation of as333Formally, we should denote the output of -th residual unit at -th iteration as . But in this paragraph, we temporarily omit the superscript to simplify the presentation.

 (6)

where is the set of learnable parameters, is a step-size parameter, is the basic network unit, are the input and output of (at -th stage), respectively. Notice that standard training strategies can be directly adopted to optimize parameters of our basic architecture. If necessary, one may further jointly fine-tune parameters of the whole network after the design phase.

It is easy to check that the network in Eq. (6) actually recursively performs coordinate descent steps (i.e., ) to propagate . So from optimization viewpoint, we interpret

as a descent-direction-estimation architecture for the optimization of the subproblem in Eq. (

4). While in more challenging scenario (e.g., hard to define an explicit and solvable for this subproblem), we can still learn built-in propagation architectures from training data to obtain our desired solution.

Proximal Error Correction: Now it is ready to give the formal updating scheme of . We can see that built-in architectures in Eq. (6) actually do not exactly optimize the original energy in Eq. (1). So it is necessary to introduce an additional step to control our propagation at each iteration. Specifically, denote as the output of built-in network in Eq. (6) at -th iteration. Then we adopt a proximal-gradient-like scheme [Wang et al.2017] to formally update

 xk+1∈argminxr(x)+12∥x−(vk+1−∇fμkxk(vk+1))∥2:=proxr(vk+1−∇fμkxk(vk+1)), (7)

where is Moreau’s proximal operator of .

Overall, our proposed deep model, called Proximal Alternating Direction Network (PADNet), is summarized in Alg. 1. Notice that we actually consider PADNet in two different scenarios, which can be categorized by properties of prior regularization in Eq. (1). That is:

• Simple priors: can be computed in closed-form.

• Complex priors: is intractable or is unknown.

We perform error correction (i.e., Step 4 in Alg. 1) in the first case (Explicit PADNet or EPADNet for short) but directly propagate the output of built-in networks (i.e., Step 3 in Alg. 1) in the second case (Implicit PADNet or IPADNet for short). Theoretical results for these two different scenarios will be respectively proved in the next section.

## Learning with Convergence Guarantee

In general, unrolling task-aware optimization schemes may incorporate rich domain-knowledge into the network structure. Unfortunately, the sequence generated by most existing unrolled deep models will no longer have convergence guarantee, even though nice theoretical results have been proved and verified for their original optimization schemes.

Fortunately, we in this work demonstrate that under some mild conditions, the propagation generated by our PADNet is globally converged444Notice that “globally converged” in this paper is in the sense that the whole sequence generated by our deep model is converged and this concept has been widely used in non-convex optimization [Attouch et al.2010] society., even with built-in network architectures designed in heuristic manners.

### Convergence Behavior Analysis of PADNet

To make our paper self-contained, some necessary definitions should be presented before the formal analysis. Indeed, these concepts have been widely known in variational analysis and optimization and one may refer to [Rockafellar and Wets2009, Attouch et al.2010] and references therein for more details.

###### Definition 1.

We give necessary definitions, including proper and lower semi-continuous, coercive and semi-algebraic.

• A function is said to be proper and lower semi-continuous if , where and at any point .

• A function is said to be coercive, if is bounded from below and if , where is the norm.

• A subset of is a real semi-algebraic set if there exist a finit number of real polynomial functions such that

 S=p⋃j=1q⋂i=1{x∈Rn:gij(x)=0 and hij(x)<0}. (8)

A function is called semi-algebraic if its graph is a semi-algebraic subset of .

###### Remark 1.

Indeed, many functions arising in learning and vision areas, including norm, rational norms (i.e., with positive integers and ) used in our experimental part and their finite sums or products, are all semi-algebraic.

In the following, we first analyze PADNet for tasks with simple priors. Specifically, given a variable , we estimate the discrepancy between it and the optimal solution of Eq. (2) by the function

 Ek(x):=gx+∇fμkxk(x), where gx∈∂r(x). (9)

Here is deduced based on the first-order optimality condition of Eq. (2) at -th iteration. Then with the following simple error condition, we prove in Theorem 1 that the propagation of EPADNet indeed globally converges to a critical-point of Eq. (1). Please refer to supplemental materials for necessary preliminaries and all proofs of the proposed theories in this paper.

###### Condition 1.

(Error Condition) The error function (in Eq. (9)) at -th iteration should satisfy , where is a universal constant.

###### Theorem 1.

(Critical-Point Convergence of Explicit PADNet) Let be continuous differential, be proper and lower semi-continuous and be coercive. Then EPADNet converges to a critical point of Eq. (1) under Condition 1. That is, generated by EPADNet is a bounded sequence and its any cluster point is a critical point of Eq. (1) (i.e., satisfying ). Furthermore, if is semi-algebraic, then globally converges to a critical point of Eq. (1).

###### Remark 2.

With the semi-algebraic property of , we can also obtain convergence rate of EPADNet based on a particular desingularizing function with a constant and parameter (defined in [Chouzenoux, Pesquet, and Repetti2016]). Specifically, the sequence converges after finite iterations if . The linear and sublinear rates can be obtained if choosing and , respectively.

It has been verified that broad class of functions arising in learning problems (even nonconvex and nonsmooth) satisfy assumptions in Theorem 1. For example, both norm and rational norm with (i.e., , and are positive integers) are proper, lower semi-continuous and semi-algebraic.

Based on above analysis, we propose a learning framework (summarized in Alg. 2) to adaptively design and train globally converged deep models for different learning tasks.

###### Remark 3.

Theorem 1 together with Alg. 2 actually provides a flexible framework with solid theoretical guarantee for deep model design and we only need to check whether built-in networks satisfy Condition 1 during their design phase. Furthermore, in general, any architectures satisfying this condition (even designed in engineering manner) can be incorporated into our deep models.

In contrast, when handling tasks with complex priors, neither error checking (i.e., Step 6 in Alg. 2) during design and training nor error correction (i.e., Step 4 in Alg. 1) during test will be performed. Therefore, we cannot obtain the same convergence results as that in Theorem 1. Fortunately, by enforcing another easily satisfied condition to built-in architectures, we would still prove a fixed-point convergence guarantee for IPADNet.

###### Condition 2.

(Architecture Condition) For any given input , the architecture should satisfy , where is a universal constant.

Notice that this bound condition is relatively weak and we can check that most commonly used linear and nonlinear operations in existing deep networks satisfy it.

###### Theorem 2.

(Fixed-Point Convergence of Implicit PADNet) Let be continuous differential with bounded gradients. Then IPADNet is converged under Condition 2. That is, generated by IPADNet is a Cauchy sequence, so that is globally converged to a fixed-point.

###### Remark 4.

Theorem 2 actually provides a theoretically guaranteed paradigm to fuse both analytical and empirical informations to build deep models for challenging learning tasks. That is to say, we can simultaneously design model-based fidelity function to reveal our theoretical understandings of the problem and learn complex priors from training data by model-free network architecture .

To end our analysis, we emphasize that the above convergence results are the best we can ask for unless other stronger assumptions are made on the given learning task.

### Implementable Error Calculation

It can be observed in Eq (9) that directly calculating using its theoretical definition is challenging due to the subgradient term . So we provide a calculable formulation based on the following derivations. Specifically, using Eq. (7), we have

 xk+1=proxr(vk+1−∇fμkxk(vk+1))=proxr(xk+1−∇fμkxk(xk+1)+(μk−1)(xk+1−vk+1)−∇f(vk+1)+∇f(xk+1)). (10)

By setting in Eq. (10) and following Theorem 1, we directly have that if , then

 ek+1−∇fμkxk(xk+1)∈∂r(xk+1). (11)

Therefore, we actually obtain the following implementable error calculation formulation for

 Ek(xk+1):=ek+1=(μk−1)(xk+1−vk+1)−∇f(vk+1)+∇f(xk+1). (12)

### Discussions

Intuitively, one may argue that building a deeper network should definitely result good performance. But unfortunately, many empirical evidences [Simonyan and Zisserman2014] have suggested that the improvement cannot be trivially gained by simply adding more layers, or worse, deeper networks even suffer from a decline on performance in some applications [Shen, Lin, and Huang2016]. Therefore, it is particularly worthy of investigating the intrinsic propagation behaviors for networks with different topological structures and architectures from more solid theoretical perspective.

Indeed, our above theories have built intrinsic theoretical connections between unrolled deep models and original numerical schemes. We also investigate conditions for incorporating heuristic architectures into the proposed deep model. Therefore, the studies in this paper should provide a new perspective and introduce several powerful tools from optimization area to address the challenging but fundamental issues discussed in above paragraph.

## Experiments

To verify our theoretical results and demonstrate the effectiveness of our deep models in application fields, we apply PADNet on two real-world applications, i.e., non-blind deconvolution and single image haze removal. All experiments are conduced on a PC with Intel Core i7 CPU at 3.4 GHz, 32 GB RAM and a NVIDIA GeForce GTX 1050 Ti GPU.

### Non-blind Deconvolution

We first consider non-blind deconvolution, which is an important task in learning and vision areas. Specifically, given an observation (e.g., image), the latent signal can be processed in a filtered domain as follows [Krishnan and Fergus2009, Schmidt and Roth2014] , where is a set of filters (e.g., horizontal and vertical gradient operations), denotes convolution, is a point spread function and

denotes errors/noises. This problem can be formulated as the maximum-a-posteriori estimation

 x∗=argmaxxp(x|y)=argmaxxlogp(y|x)+logp(x). (13)

Here we follow typical choices to consider -fidelity (i.e., ) and -regularization (i.e., , ), where is the parameter. We adopt results in [Zuo et al.2013] to calculate the proximal operation of general -minimization. In following deconvolution experiments, we always use the set of images of size built in [Chen and Pock2017] as our training data. Two commonly used image deblurring benchmarks respectively collected by Levin et. al. [Levin et al.2009] (32 blurry images of size ) and Sun et. al. [Sun et al.2013] (640 blurry images with Gaussian noises, sizes range from to ) are used for testing.

#### Convergence Behaviors on Gradient Domain

The gradient of images plays very important role in image structure analysis. Here we first consider deconvolution on gradient domain to verify the convergence behaviors of our designed deep models to a given energy with a simple prior. Specifically, the energy in gradient domain is defined as

 ming12∥g⊗k−D(y)∥2+λ∥g∥pp, (14)

where denotes the gradient of the latent image. We first build the basic architecture as cascade of two convolutions with one RBF nonlinearity [Schmidt and Roth2014] between them. Then we perform Alg. 2 based on Eq. (14) with to respectively design three EPADNet models. We also establish an IPADNet model from Alg. 2 with only the fidelity . To compare iteration behaviors with conventional optimization strategies, we also perform popular ADMM and Half-Quadratic Splitting (HQS) [Zuo et al.2013] algorithms on Eq. (14) with the same -regularizer and parameters.

The averaged convergence results of compared algorithms on Levin et. al.’ benchmark are reported in Tab. 1. As IPDANet does not depend on functions, we just repeated its results for three cases (i.e., ) in this table.

It can be seen that our designed deep models (i.e., one IPADNet and three EPADNets) need extremely less iterations but obtain more accurate estimations than conventional optimization schemes. Moreover, the performance of IPDANet is better than EPADNets regularized by and , but a little worse than the energy. These results make sense because the prior learned from training data should perform better than the relatively improper handcrafted priors (e.g., and norms in this task). If the prior function can fit the data distribution well (e.g., norm here), the critical-point convergence guarantee of EPADNet will definitely result better performance, compared with the relatively weak fixed-point convergence of IPADNet.

We also plot curves of relative errors (i.e., iteration error and reconstruction error ) and error condition (referring to and ) on an example image from this benchmark in Fig. 1, where denotes the ground-truth image gradient. To provide more readable illustrations of convergence behaviors, here all relative errors are plotted starting from . We also show zoomed in curve comparisons of our two deep models in Fig. 1 (b). Notice that we indeed only have one implicit deep model for this task. But to compare its performance with methods based on different energies, we just repeatedly plot its relative errors (as green curves) in multiple subfigures.

It is observed in Fig. 1 (a) that our deep models always converged within - iterations, while both ADMM and HQS needed dozens of steps to stop their iterations. The dashdot curves in Fig. 1 (a) show that the designed EPADNet satisfied the constraint of errors in Condition 1 all the time, thus the global convergence to the critical-point of Eq. (14) can be experimentally guaranteed. All these results verified our proved theories. We can further see in Fig. 1 (b) that propagations of our two deep models (solid red and green curves) had obtained significantly lower reconstruction errors than conventional algorithms even just after the first iteration (i.e., the initial points of these curves). This is because built-in networks actually learned a direct descent direction toward the desired solutions, which demonstrated the superiority of our framework again.

#### Explicit / Implicit PADNet on Image Domain

Non-blind deconvolution on image domain is commonly formulated as the following energy minimization task [Li et al.2013, Krishnan and Fergus2009, Schmidt and Roth2014]

 minx,g12fex(x,g;y)+λ∥g∥pp, (15)

in which the fidelity can be formulated as

 fex(x,g;y):=infx{∥x⊗k−y∥2+β∥D(x)−g∥2}. (16)

Here is a penalty parameter, and are variables in image and gradient domains, respectively.

In this part, we build an explicit PADNet using Eq. (15) to pursuit , in which we set and introduce an additional linear layer derived by to transfer variables from gradient domain to image domain. In contrast, by simply defining and discarding explicit -priors, we can also design an implicit PADNet to learn priors from training data for this task. Here the basic architecture (used in our deep models) consists of convolution layers. The ReLU nonlinearities are added between each two linear layers accordingly and batch normalizations (BN) [Ioffe and Szegedy2015] are also introduced for convolution operations from -nd to -th linear layers.

We compare performances of our two deep models against state-of-the-art algorithms, including TV [Li et al.2013], HL [Krishnan and Fergus2009], EPLL [Zoran and Weiss2011], IDD-BM3D [Danielyan, Katkovnik, and Egiazarian2012], MLP [Schuler et al.2013] and CSF [Schmidt and Roth2014] on both standard Levin et. al.’ and more challenging Sun et. al.’ benchmarks. The averaged quantitative results (i.e., PSNR and SSIM), are reported in Tab. 2, in which “(E)” and “(I)” denote algorithms based on explicit and implicit PADNet, respectively. We can recognize that “Ours (E)” and the works in [Li et al.2013, Krishnan and Fergus2009] actually all address this task by optimizing Eq. (15). Thanks to built-in networks, we achieved much better performance than conventional optimization approaches. We further observed that discriminative learning approaches [Schmidt and Roth2014, Schuler et al.2013] also performed well as they learn adaptive networks from training data. Overall, the results of our two algorithms are better than other compared approaches. The PSNR score of “Ours (E)” is even higher than that of “Ours (I)” on standard Levin et. al.’s dataset. We argue that this is reasonable because prior actually has been powerful enough for relatively simple test images. While “Ours (I)” obtained the best quantitative results on Sun et. al.’s dataset, which demonstrated that our prior-and-data aggregated framework is especially more efficient on real-world challenging applications (see Remark 4).

### Single Image Haze Removal

Finally, we evaluate PADNet on the task of single image haze removal, which is a challenging real-world vision application. Most existing works address this task as estimating the latent scene radiance from given hazy observation

from the following linear interpolation formula

 I(x)=t(x)J(x)+(1−t(x))A, (17)

where is transmission, is global atmospheric light and denotes the pixel index.

It is known that transmission expresses the relative portion of light that managed to survive the entire path between the observer and a surface point in the scene without being scattered [Fattal2014]. With Eq. (17), we have that estimating accurate transmission map plays the core role in this task. However, due to multiple solutions exist for a single hazy image, the problem is highly ill-posed. Recent works often design their models based on different perspectives on transmissions within the following prior regularized energy

 mint12∥t−~t∥2+λr(t),e.g.,⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩rTGV(t)=∥∇t−z∥1+β∥∇z∥1,\@@cite[cite]{[\@@bibref{}{chen2016robust}{}{}]},rMRF(t)=∑j∥wj⊙(dj⊗t)∥1,\@@cite[cite]{[\@@bibref{}{meng2013efficient}{}{}]}, (18)

where and

respectively denote the discrete transmission vector and its propagation guidance. The regularization

can be derived based on different tools, e.g., Total Generalized Variation (TGV) [Chen, Do, and Wang2016] and Markov Random Field (MRF) [Meng et al.2013]. In , denotes an auxiliary variable, is Hadamard product and are weight vectors for local filters .

In this part, we first utilize implicit strategy to design PADNet based on fidelity (with the same guidance defined in [Meng et al.2013]) to estimate transmission and then recover the latent scene radiance from Eq. (17) as that in [He, Sun, and Tang2011, Chen, Do, and Wang2016, Meng et al.2013]. We build basic architecture with convolution layers (ReLU and BN operations are incorporated using the same strategy as that in above image deconvolution task) and train it on synthetic hazy images [Ren et al.2016] for our deep model.

We evaluate the performance of our deep model together with five existing handcrafted-prior based algorithms (i.e., [He, Sun, and Tang2011, Meng et al.2013, Chen, Do, and Wang2016, Berman, Avidan, and others2016, Li et al.2014]) and two empirically designed deep networks (i.e.,  [Ren et al.2016, Cai et al.2016])555In this subsection, we always denote these methods as He et. al., Meng et. al., Chen et. al., Berman et. al., Li et. al., Ren et. al. and Cai et. al., respectively. on the commonly used Fattal’s benchmark [Fattal2008], which consists of challenging hazy images, including architecture, natural scenery and indoor scene. The averaged quantitative results, including PSNR, SSIM and running time in seconds (denoted as “Time (s)”), are given in Tab. 3. Two empirically designed networks in [Cai et al.2016, Ren et al.2016] performed better than most conventional prior-based methods. Though obtained good dehazing results, the work in [Chen, Do, and Wang2016] has the longest running time. Our proposed deep model achieved the best performance among all compared algorithms on this benchmark. This is mainly because that PADNet can successfully fuse cues from both human perspectives and training data to estimate haze distributions. Furthermore, the speed of PADNet is the fastest among all compared methods, which also verified the efficiency of our framework.

## Conclusions

This paper proposed a novel framework, named proximal alternating direction network (PADNet), to design deep models for different learning tasks. Our theoretical results first showed that we can utilize empirically designed architectures to build globally converged PADNet for the given energy minimization model. We further proved that a converged PADNet can also be designed by learning priors from training data. At last we experimentally verified our analysis and demonstrated promising results of PADNet on different real-world applications.

## Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (Nos. 61672125, 61632019, 61432003, 61572096 and 61733002), and the Hong Kong Scholar Program (No. XJ2015008). Dr. Liu is also a visiting researcher with Shenzhen Key Laboratory of Media Security, Shenzhen University, Shenzhen 518060

## Supplemental Material

### Preliminaries

The following definitions and lemmas have been widely known in variational analysis and optimization. More details can also be found in [Rockafellar and Wets2009, Attouch et al.2010] and references therein.

###### Definition 2.

Here we give necessary definitions, including proper and lower semi-continuous, coercive and semi-algebraic.

• A function is said to be proper and lower semi-continuous if , where and at any point .

• A function is said to be coercive, if is bounded from below and if , where is the norm.

• A subset of is a real semi-algebraic set if there exist a finit number of real polynomial functions such that

 S=p⋃j=1q⋂i=1{x∈Rn:gij(x)=0 and hij(x)<0}. (19)

A function is called semi-algebraic if its graph is a semi-algebraic subset of .

Indeed, many functions arising in learning areas, including norm, rational norms (i.e., with positive integers and ) and their finite sums or products, are all semi-algebraic.

###### Lemma 1.

Here we list some necessary properties will be used in our following proofs.

• In the nonsmooth context, the Fermat’s rule remains unchanged. That is, if is a local minimizer of , then .

• If is a continuously differentiable function, then .

• Let be proper and lower semi-continuous and be a sequence in the graph of . If and as , then is also in the graph of .

Finally, we recall that a critical point of a function is a point in the domain of , whose subdifferential contains .

### Proof of Theorem 1

###### Proof.

The error condition of in (9) can be equivalently reformulated as

 xk+1∈argminxfμkxk(x)+r(x)−⟨x,ek+1⟩, (20)

where . Therefore, the propagation behavior of Alg. 2 actually should be understood as exactly solving Eq. (20) to output at -th iteration. Thus

 F(xk)−⟨xk,ek+1⟩≥F(xk+1)−⟨xk+1,ek+1⟩+μk2∥xk+1−xk∥2⇒F(xk)−F(xk+1)≥μk2∥xk+1−xk∥2−⟨xk+1−xk,ek+1⟩≥μk2∥xk+1−xk∥2−12ρ∥xk+1−xk∥2−ρ2∥ek+1∥2≥(μk2−ηC2E2−12η)∥xk+1−xk∥2=(μk4−C2Eμk)∥xk+1−xk∥2, (21)

in which can be any positive constant but the last equality only holds with . So we have . As is coercive, we also have that is bounded. Furthermore, as is nonincreasing, it converges to a constant (denoted as ). So summing (21) from to leads to

 mink{μk4−C2Eμk}∞∑k=0∥xk+1−xk∥2≤F(x0)−F∗<∞. (22)

Using , we have if .

Following (9), we have

 ek+1−μk(xk+1−xk)∈∂F(xk+1). (23)

Then it is straightforward that

 ∥∂F(xk+1)∥=∥ek+1−μk(xk+1−xk)∥≤(μk+CE)∥xk+1−xk∥→0, % if k→∞. (24)

Let be any cluster point of , i.e., if . Since is lower semi-continuous, we have

 liminfj→∞r(xkj)≥r(x∗). (25)

From (20), we have

 (26)

Setting in (26) and letting , we obtain

 limsupj→∞r(xkj)≤r(x∗), (27)

in which we used the facts that is continuous and bounded. So we have that tends to as . This together with the continuity of directly results to

 limj→∞F(xkj)=F(x∗). (28)

Using (24), (28) and Lemma 1, we have that . So any cluster point of is a critical point of Eq. (1).

If we further assume that is a semi-algebraic function, then it satisfies the well-known Kurdyka-Łojasiewicz property [Rockafellar and Wets2009, Attouch et al.2010]. So we have that following [Attouch et al.2010]. This implies that is a Cauchy sequence and hence a convergent sequence. Since we have proved in above that if , we finally have if , which completes the proof. ∎

### Proof of Theorem 2

###### Proof.

The updating scheme of in Alg. 2 implies

 u−(vk−λk)=−1ρk∇fμkxk(u). (29)

Moreover, it is easy to check that is bounded, i.e. there exists such that . So at -th iteration, we have

 ∥uk+1−(vk−λk)∥=1ρk∥∇fμkxk(u)∥≤Mρk. (30)

Next, define . Then following Steps 5 and 8 in Alg. 2 and the bounded constraint on , we have

 ∥vk+1−(uk+1+λk)∥=∥Nαk(~vk)−~vk∥≤CNαk=CN√1ρk. (31)

Using Eq. (30) and (31), we can show

 ∥vk+1−vk∥≤∥vk+1−~vk∥+∥~vk−vk∥≤CN√1ρk+Mρk=√1ρk(CN+M√1ρk)≤(CN+M√1ρ0)√1ρk. (32)

As for , we have

 ∥λk+1∥=∥λk+(uk+1−vk+1)∥=∥λk+uk+1−~vk−(Nαk(~vk)−~vk)∥=∥Nαk(~vk)−~vk∥≤CN√1ρk⇒∥λk+1−λk∥≤∥λk+1∥+∥λk∥≤2CN√1ρk, (33)

Following Step 11 in Alg. 2, we have

 ∥uk+1−uk∥=∥(λk+1−λk+vk+1)−(λk−λk−1+vk)∥≤∥λk+1−λk∥+∥λk−λk−1∥+∥vk+1−vk∥≤2CN√1ρk+2CN√1ρk−1+√1ρk(CN+M√1ρ0)≤((3+2γ)CN+Mρ0)√1ρk. (34)

So is a Cauchy sequence and hence there exists fixed point such that if , which completes the proof. ∎

## References

• [Andrychowicz et al.2016] Andrychowicz, M.; Denil, M.; Gomez, S.; Hoffman, M. W.; Pfau, D.; Schaul, T.; and de Freitas, N. 2016. Learning to learn by gradient descent by gradient descent. In NIPS, 3981–3989.
• [Attouch et al.2010] Attouch, H.; Bolte, J.; Redont, P.; and Soubeyran, A. 2010. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality. Mathematics of Operations Research 35(2):438–457.
• [Berman, Avidan, and others2016] Berman, D.; Avidan, S.; et al. 2016. Non-local image dehazing. In CVPR, 1674–1682.
• [Cai et al.2016] Cai, B.; Xu, X.; Jia, K.; Qing, C.; and Tao, D. 2016. Dehazenet: An end-to-end system for single image haze removal. IEEE TIP 25(11):5187–5198.
• [Chen and Pock2017] Chen, Y., and Pock, T. 2017. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE TPAMI 39(6):1256–1272.
• [Chen, Do, and Wang2016] Chen, C.; Do, M. N.; and Wang, J. 2016. Robust image and video dehazing with visual artifact suppression via gradient residual minimization. In ECCV, 576–591.
• [Chouzenoux, Pesquet, and Repetti2016] Chouzenoux, E.; Pesquet, J.-C.; and Repetti, A. 2016. A block coordinate variable metric forward–backward algorithm. Journal of Global Optimization 66(3):457–485.
• [Danielyan, Katkovnik, and Egiazarian2012] Danielyan, A.; Katkovnik, V.; and Egiazarian, K. 2012. Bm3d frames and variational image deblurring. IEEE TIP 21(4):1715–1728.
• [Fattal2008] Fattal, R. 2008. Single image dehazing. ACM Transactions on Graphics (TOG) 27(3):72.
• [Fattal2014] Fattal, R. 2014. Dehazing using color-lines. ACM Transactions on Graphics (TOG) 34(1):13.
• [Gregor and LeCun2010] Gregor, K., and LeCun, Y. 2010. Learning fast approximations of sparse coding. In ICML, 399–406.
• [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
• [He, Sun, and Tang2011] He, K.; Sun, J.; and Tang, X. 2011. Single image haze removal using dark channel prior. IEEE TPAMI 33(12):2341–2353.
• [Ioffe and Szegedy2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 448–456.
• [Krishnan and Fergus2009] Krishnan, D., and Fergus, R. 2009. Fast image deconvolution using hyper-laplacian priors. In NIPS, 1033–1041.
• [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1097–1105.
• [Levin et al.2009] Levin, A.; Weiss, Y.; Durand, F.; and Freeman, W. T. 2009. Understanding and evaluating blind deconvolution algorithms. In CVPR, 1964–1971.
• [Li et al.2013] Li, C.; Yin, W.; Jiang, H.; and Zhang, Y. 2013. An efficient augmented lagrangian method with applications to total variation minimization. Computational Optimization and Applications 56(3):507–530.
• [Li et al.2014] Li, Y.; Guo, F.; Tan, R. T.; and Brown, M. S. 2014. A contrast enhancement framework with jpeg artifacts suppression. In ECCV, 174–188.
• [Lin, Liu, and Su2011] Lin, Z.; Liu, R.; and Su, Z. 2011. Linearized alternating direction method with adaptive penalty for low-rank representation. In NIPS, 612–620.
• [Meng et al.2013] Meng, G.; Wang, Y.; Duan, J.; Xiang, S.; and Pan, C. 2013. Efficient image dehazing with boundary constraint and contextual regularization. In ICCV, 617–624.
• [Parikh, Boyd, and others2014] Parikh, N.; Boyd, S.; et al. 2014. Proximal algorithms. Foundations and Trends® in Optimization 1(3):127–239.
• [Ren et al.2016] Ren, W.; Liu, S.; Zhang, H.; Pan, J.; Cao, X.; and Yang, M.-H. 2016. Single image dehazing via multi-scale convolutional neural networks. In ECCV, 154–169.
• [Rockafellar and Wets2009] Rockafellar, R. T., and Wets, R. J.-B. 2009. Variational analysis, volume 317. Springer Science & Business Media.
• [Schmidt and Roth2014] Schmidt, U., and Roth, S. 2014. Shrinkage fields for effective image restoration. In CVPR, 2774–2781.
• [Schuler et al.2013] Schuler, C. J.; Christopher Burger, H.; Harmeling, S.; and Scholkopf, B. 2013.

A machine learning approach for non-blind image deconvolution.

In CVPR, 1067–1074.
• [Shen, Lin, and Huang2016] Shen, L.; Lin, Z.; and Huang, Q. 2016.

Relay backpropagation for effective learning of deep convolutional neural networks.

In ECCV, 467–482. Springer.
• [Simonyan and Zisserman2014] Simonyan, K., and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
• [Sun et al.2013] Sun, L.; Cho, S.; Wang, J.; and Hays, J. 2013. Edge-based blur kernel estimation using patch priors. In ICCP.
• [Teh et al.2003] Teh, Y. W.; Welling, M.; Osindero, S.; and Hinton, G. E. 2003. Energy-based models for sparse overcomplete representations. JMLR 4(Dec):1235–1260.
• [Wang et al.2017] Wang, Y.; Liu, R.; Song, X.; and Su, Z. 2017. An inexact proximal alternating direction method for non-convex and non-smooth matrix factorization and beyond. arXiv preprint arXiv:1702.08627.
• [Xu, Lin, and Zha2016] Xu, C.; Lin, Z.; and Zha, H. 2016. Relaxed majorization-minimization for non-smooth and non-convex optimization. In AAAI, 812–818.
• [Zhao, Mathieu, and LeCun2016] Zhao, J.; Mathieu, M.; and LeCun, Y. 2016. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126.
• [Zoran and Weiss2011] Zoran, D., and Weiss, Y. 2011. From learning models of natural image patches to whole image restoration. In ICCV, 479–486.
• [Zuo et al.2013] Zuo, W.; Meng, D.; Zhang, L.; Feng, X.; and Zhang, D. 2013. A generalized iterated shrinkage algorithm for non-convex sparse coding. In ICCV, 217–224.