In this work, we consider the optimization problem
where has blocks (), is continuously differentiable, but a series of is not necessarily differentiable. Notice that convexity is not assumed for , or . By considering as an extended value function (i.e., take value), we can incorporate the set constraint into since this is equivalent to minimize the indicator function of . Therefore, we will not include this set constraint in our following analysis. Typically, captures the loss of data fitting from the specific task modeling and is the regularization that promotes desired structures on the variable . Problems appearing in many learning and vision applications, such as sparse coding [Lin et al.2011]
, tensor factorization[Xu and Yin2017], image restoration [Liu et al.2018], and data clustering [Yang et al.2016], can all be (re)formulated in the form of Eq. (1).
1.1 Related Works
One of the most prevalent algorithms to solve Eq. (1) is the operator splitting approach. The main idea behind such kind of schemes is to reduce complex problems built from simple pieces into a series smaller subproblems which can be solved sequentially or in parallel. In the past several decades, a variety of splitting methods have been designed and analyzed. For example, [Passty1979] provided a prototype of the Forward-Backward Splitting (FBS) and proved its ergodic convergence. [Beck and Teboulle2009] presented the convergence rate of proximal gradient (PG) and accelerated proximal gradient (APG, also known as FISTA). Recently, [Davis and Yin2016] provided a unified way to analyze the convergence rate of Peaceman-Rachford Splitting (PRS) and Douglas-Rachford Splitting (DRS). It is also known that the widely used Alternating Direction Method of Multipliers (ADMM) can be reformulated within the operator splitting (e.g., DRS) framework in the dual space [Lin et al.2015]. Though with mathematically proved convergence properties, the generally designed algorithms may still fail on some particular nonconvex optimization models in real scenarios. This is mainly because that due to their fixed updating schemes, it is hard to escape the unwanted saddle points during iterations.
To improve the performance in practical real-world applications, some researches tried to parameterize exiting iteration schemes and learned their parameters in the resulted propagation models. For example, [Liu et al.2016]
learned the parameters of a parameterized partial differential equation for various image and video processing tasks. Similarly,[Chen et al.2015]
introduced a higher-order diffusion system to perform data-dependent gradient descent for image denoising and super-resolution. The studies in[Uwe and Stefan2014] and [Yang et al.2017]
respectively parametrized the half-quadratic splitting and ADMM for practical applications, such as non-blind deconvolution and MRI imaging. Very recently, inspired by the success of deep networks in different application fields, some works also tried to replace the standard iterations by existing network architectures. By considering convolutional neural networks (CNNs) as special image priors,[Zhang et al.2017] proposed an iterative CNN scheme to address image restoration problems.
However, we have to point out that although with relatively better performance in some specific tasks, the nice convergence properties proved from theoretical side are completely missing in these methods. That is, neither the adaptive parameterization nor the replaced CNNs mentioned above can preserve the convergence results proved for the original iteration schemes. Moreover, it is even impossible to investigate and control the iterative behaviors (e.g., descent) of these methods, since their learned iterations actually no longer solve the original optimization model.
In this work, we propose Learnable Bregman Splitting (LBS), a novel deep operator splitting algorithm for nonconvex optimization in real-world scenarios. Specifically, we first introduce a Bregman distance function to penalize the variables at each iteration. Then the basic LBS updating scheme is established based on a relaxed Krasonselskii-Mann iteration [Davis and Yin2016]. By introducing a novel triple operator splitting strategy, we can successfully combine the task-model-inspired and data-learning-driven operators within the LBS algorithmic framework. In summary, our contributions mainly include:
LBS provides a novel learning strategy to extend prevalent mathematically designed operator splitting schemes for task-specific nonconvex optimization. Thanks to the learnable deep architectures, we can learn our iterations on collected training data to avoid unwanted solutions in particular applications.
Different from most existing learning-based optimization algorithms (e.g., iteration parameterization and CNN incorporation methods mentioned above), in which there is no theoretical guarantee, we provide rich investigations on the iterative behaviors, prove the global convergence and estimate the convergence rate of our LBS.
We also demonstrate how to apply our algorithm for different computer vision applications and extensive results verify that LBS outperforms state-of-the-art methods on all the compared problems.
2 Learnable Bregman Splitting Method
In this section, a learning-based operator splitting method, named Learnable Bregman Splitting (LBS), is developed for the nonconvex optimization model in Eq. (1).
2.1 Bregman Distance Penalization
As a fundamental proximity measure, the Bregman distance111The use of Bregman distance in optimization within various contexts is well spread. Many interesting properties of this function can be found in the comprehensive work [Bauschke et al.1997]. plays important roles in various iteration algorithms. However, since it does not satisfy the triangle inequality nor symmetry, this function is not a real metric. Given a convex differential function , the associated Bregman distance can be written as
Clearly, is strictly convex with respect to the first argument. Moreover, for all and is equal to zero if and only if . So actually provides a natural (asymmetric) proximity measure between points in the domain of .
In this work, we introduce as a penalty term for each at -th iteration. That is, we actually minimize the following energy to update :
where we denote and is the penalty parameter. It will be demonstrated that brings nice convergence properties for the proposed optimization model when it is -strong convex [Bauschke et al.1997].
2.2 Uniform Coordinate Updating Scheme
In this work, we consider the following general coordinate update scheme to minimize the energy function in Eq. (3):
where denotes the update direction (regarding to the problem) on , is a step size and denotes the -th block of the given variable. It should be pointed out that by formulating (here denotes the identity mapping), Eq. (4) can be further recognized as a relaxed Krasonselskii-Mann iteration [Shi et al.2016] with the operator (i.e., ) and then various existing first-order schemes can be reformulated in the form of Eq. (4).
Specifically, by defining (resolvent), (reflection) for (operator about ), we can obtain a variety of prevalent splitting schemes, such as FBS, PRS, and DRS. As for the operator in our work, if setting and , we obtain FBS from Eq. (4), i.e., where denotes the operator composition. By considering 222The proximal operation with respect to (denoted as ) is defined as . and , we further have the well-known proximal (or projected) gradient scheme from FBS. Setting and , Eq. (4) reduces to which is just the standard PRS iteration. Similarly, with the same in PRS and , we can also deduce DRS.
Additionally, it should be pointed out that the well-known ADMM [Lin et al.2011] can also be deduced by applying DRS on its Lagrange dual space [Davis and Yin2016]. Therefore, although the original ADMM is designed for linearly constraint models, we can still reformulate it as a special case of Eq. (4) in the dual variable space. Thus Eq. (4) actually can also be utilized to address the constrained problems.
2.3 Splitting with Learnable Architecture
As discussed above, most existing splitting algorithms (e.g., FBS, PRS and DRS) specify the operator only based on the optimization model. However, due to the nonconvex nature of the model, it is hard for these schemes to escape undesired local minimum. Moreover, the complex data distribution in real applications will also slow down the redesigned iterations.
To partially address these issues, we provide a new splitting strategy, in which a learnable operator is introduced to extract information from the data. That is, we consider the following triple splitting scheme:
where and are operators related to in Eq. (3). Here we just follow a FBS-like strategy to define and . As for , we would like to build it as a learnable network architecture and train its parameters from collected training data set333See Sec. 4 for the details of this operator and its training strategy.. In this way, we can successfully incorporate data information to improve the iterative performance of the proposed algorithm.
Notice that it is challenging to analyze the convergence issues for the existing network-incorporated iterations (e.g., [Zhang et al.2017]
), since all their schemes are built in heuristic manners.In contrast, we will demonstrate in the following section that the convergence of our LBS can be strictly proved.
2.4 The Complete Algorithm
It can be seen that the learnable operator are not deduced from strict optimization rule, there may exists iteration errors when calculating at each stage. Thus we introduce a new condition to control the inexactness of our updating scheme at each iteration. Specifically, we define the optimality errors of a given variable at -th iteration based on the first order subdifferential of , i.e.,
where (here we denote as the limiting Ferchet subdifferential of [Xu and Yin2017]) and . Then we consider the following so-called Relaxed Optimality Condition (ROC) for the given .
(Relaxed Optimality Condition) Given any , we define the relaxed optimality condition of for () as , where is a fixed positive constant.
Based on the above condition, we are ready to propose our LBS algorithm for solving Eq. (1) in Alg. 1. Notice that the UCUS iteration, denoted as , are independently stated in Alg. 2. It can be seen that if ROC is satisfied, the LBS iterations are fully based on the learnable network operator. While for some iterations, which do not satisfy ROC, we may still perform the model-based operators to guarantee the final convergence. For convenience, hereafter the subvectors and are denoted as and for short, respectively. We also denote , in which .
3 Convergence Analysis
In this section, we provide strict analysis on the convergence behaviors of LBS. The following assumptions on the functions , , and are necessary for our analysis. Notice that all these assumptions are fairly loose in optimization area and satisfied in most vision and learning problems.
1) is Lipschitz smooth and is proximable444A function is proximable if it is easy to obtain the minimizer of for any given y and .. 2) is coercive.
The roadmap of our analysis is summarized as follows: We first prove that the non-increase of objective, the boundedness of the variables sequence, and the convergence of subsequence in Propositions 1, 2, and 3, respectively. Then prove Theorem 1 that LBS can generate Cauchy sequences, which converge to the critical points of the model in Eq. (1). The convergence rate of the sequences is also analyzed in Corollary 1. The detailed proofs are represented on the arXiv report ().
The inequalities in Proposition 1 builds the relationship of and , thus we can obtain a series of useful inequalities:
where . It implies the non-increasing property of .
(Square summable). If , , are the sequences by Alg. 1, we have
(Subsequence convergence). Let be the sequence generalized by Alg. 1. If is any accumulation point of . Then we have
If is a Kurdyka–Łojasiewicz function555It should be pointed out that many functions arising in learning and vision areas, including norm and rational norms (i.e., ) are all Kurdyka–Łojasiewicz functions [Lin et al.2015]., we can further prove that is a Cauchy sequence, thus globally converges to a critical point of .
Based on the above theorem, we can estimate convergence rate as follows.
Let be a desingularizing function with a constant and a parameter . Then generated by Alg. 1 converges after finite iterations if . The linear and sub-linear rates can be obtained if choosing and , respectively.
4 Numerical Results
To verify the convergence and performance of LBS for nonconvex optimization, we apply it on two widely researched vision problems, i.e., image completion and deblurring. In our algorithm, we adopt residual network as the learnable network architecture for
, which can well describes the sparse priors. Specially, there are 19 layers in our network which includes 7 convolution layers, 6 ReLU layers, 5 batch normalization layers and one loss layer. Every convolution layer has 64 kernels of size
, and possesses the dilation attribute. In training stage, we randomly select 800 natural images from ImageNet database[Deng et al.2009]. The chosen pictures are cropped into small patches of size 35 35 and Gaussian noise is imposed to these patches. As for the Bergman distance , we choose Mahalanobis distance as in our applications [Bauschke et al.1997]. All experiments are performed on a PC with Intel Core i7 CPU @ 3.4 GHz, 32 RAM and NVIDIA GeForce GTX 1050 Ti GPU.
4.1 -Sparse Coding for Image Completion
We first consider to solve a
-sparse coding model to address the problem of image completion (also known as image inpainting). The purpose of this task is to restore a visually plausible image in which data are missing due to damage or occlusions. This problem can be formulated as:
where is the observed image, denotes a mask, is the dictionary, is its corresponding sparse coefficients and is a parameter. Following [Beck and Teboulle2009], we consider as a inverse wavelet basis (i.e., multiplying by corresponds to performing inverse wavelet transform) and thus is just the latent image (denoted as ). To enforce the sparsity of , we set ( is unit matrix) in Bergman distance and in the above coding model.
It is easy to check that Eq. (6) is just a specific case of Eq. (1) with single variable. In this following, we first verify the theoretical results proposed in this work, and then test the performance of LBS on challenging benchmark datasets.
Iteration Behaviors Analysis: We first choose example images from CBSD68 dataset [Zhang et al.2017] to demonstrate the iterative behaviors of LBS together with some other widely used splitting schemes (e,g, FBS, FISTA, and ADMM). For fair comparisons, the stopping criterion of all the compared methods are set in the same manner. That is, we denote and consider as the stopping criterion in all these methods.
Fig. 1 showed the convergence curves from different aspects, including iteration error (“”, defined as ) and reconstruction error (“”, defined as ). Our LBS have superiority against traditional FBS, FISTA, and ADMM on both convergence rates and final reconstruction. LBS only almost a dozen steps can achieve the convergence precision while FBS and FISTA need few hundreds steps and ADMM needs four dozens of steps. Since introducing the network as , our strategies have lesser reconstruction error than others obviously. The PSNR and SSIM of the final results also verify that our LBS has better performance. Concretely, our PSNR is approximately higher 3dB than the compared methods.
We also compared the curves of objective function value errors (“”, based on ) for different settings of LBS, including naive LBS (nLBS, do not check the ROC and monotone conditions) and the complete LBS in Alg. 1. From the left subfigure of Fig. 2, it is easy to observe that the proposed criteria can lead to very fast convergence, while there are severe oscillations on the curves of nLBS. Furthermore, we plotted the bars of ROC (i.e., the error and the threshold ) on the right part of Fig. 2. It can be seen that the ROC condition is always satisfied except at the last two iterations. Thus deep networks are performed at most of our iterations. Only at the last stages, LBS tended to perform model-inspired iterations (i.e., Step 8 in Alg. 1) to obtain accurate solution for the given optimization model.
Comparisons on Benchmarks: To further express the superiority of LBS, we generated random masks of different levels (including 20%, 40%, 60% and 80% missing pixels) on CBSD68 dataset [Zhang et al.2017] for comparison, which contains 68 images with the size of 481321. Then we compared LBS with four state-of-the-art methods, namely, FoE [Roth and Black2009], VNL [Arias et al.2011], ISDSB [He and Wang2014], and JSM [Zhang et al.2014]. Tab. 1 reports the averaged quantitative results, including PSNR, SSIM, and time (in second). It can be seen that regardless the proportion of masks, LBS can achieve better performance against the state-of-the-art approaches. This is mainly due to our superior strategy which using learnable network operator.
We then compared the visual performance of LBS with all these methods. Fig. 3 presented the comparisons on an image from ImageNet database [Deng et al.2009] with 60% missing pixels. It can be seen that LBS outperformed all the compared methods on both visualization and metrics (PSNR and SSIM). The edge of motorcycle wheels can be restored more smooth and clear by LBS, while other approaches exist some noises and masks to affect the visual effects.
|-||(24.45 / 0.86)||(24.92 / 0.86)||(23.27 / 0.83)||(25.40 / 0.87)||(26.11 / 0.88)|
4.2 Nonconvex TV for Image Deblurring
We further evaluate LBS on image deblurring, which is a challenging problem in computer vision area. Here we consider the following widely used total variation (TV) based formulation:
where denote the blur kernel, latent image, and blurry observation, respectively. is the nonconvex TV regularization with gradient matrices and (here we also set for the norm). is the indicator function of the set Following the half-quadratic splitting technique, Eq. (7) (with auxiliary variables and ) can be reformulated as
|-||(22.78 / 0.75)||(25.00 / 0.77)||(23.79 / 0.78)||(24.39 / 0.81)||(25.34 / 0.83)|
Fig. 4 demonstrated the convergence behaviors of LBS on . It can be seen from the left subfigure that “” of all blocks quickly decreased to , notice that “” of is even less than . On the right subfigure, the “” of and also have dramatic decline trend, which are shown along with the right vertical ordinate. Due to the different range of values, we have to plot the curves of w.r.t. the left vertical ordinate. We can see that it still obtained the least “”.
We then reported results on the challenging image deblurring benchmark dataset collected by Sun [Sun et al.2013] (which includes 640 blurry images with 1% Gaussian noises) for quantitative evaluation. We compared LBS with plenty of competitive approaches, including TV [Wang et al.2008], HL [Krishnan and Fergus2009], CSF [Uwe and Stefan2014], IDDBM3D [Danielyan et al.2012], EPLL [Zoran and Weiss2011], RTF [Schmidt et al.2016], MLP [Schuler et al.2013], IRCNN [Zhang et al.2017], FDN [Kruse et al.2017], and PADNet [Liu et al.2018]).
It is known that learning-based methods (e.g., CSF, RTF, MLP, IRCNN, FDN, and PADNet) can achieve better performance than other conventional approaches in terms of quantitative metrics (e.g., PSNR and SSIM). However, due to the weak theoretical guarantee, they are worse than LBS (see Tab. 2). Fig. 5 expressed the qualitative results of LBS against other methods (top 4 in Tab. 2) on an example blurry image, which is generated with a large scale blur kernel (7575 pixels) on an image from ImageNet [Deng et al.2009]. It can be seen that LBS can restore the text and windows more distinctly than others. Although IRCNN has relatively higher PSNR than others (but lower than LBS), its visual quality and SSIM are not satisfied.
This paper proposed Learnable Bregman Splitting (LBS), a novel deep architectures based operator splitting algorithm for task-specific nonconvex optimization. It is demonstrated that both the model-based operators and the data-dependent networks can be used in our iteration. We also provided solid theoretical analysis to guarantee the convergence of LBS. The experimental results verified that LBS can obtain better performance against most other state-of-the-art approaches.
This work is partially supported by the National Natural Science Foundation of China (Nos. 61672125, 61733002, 61572096, 61432003 and 61632019), and the Fundamental Research Funds for the Central Universities.
- [Arias et al.2011] Pablo Arias, Gabriele Facciolo, Vicent Caselles, and Guillermo Sapiro. A variational framework for exemplar-based image inpainting. IJCV, 93(3):319–347, 2011.
- [Bauschke et al.1997] Heinz H Bauschke, Jonathan M Borwein, et al. Legendre functions and the method of random bregman projections. Journal of Convex Analysis, 4(1):27–67, 1997.
- [Beck and Teboulle2009] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
- [Chen et al.2015] Yunjin Chen, Wei Yu, and Thomas Pock. On learning optimized reaction diffusion processes for effective image restoration. In CVPR, 2015.
- [Danielyan et al.2012] Aram Danielyan, Vladimir Katkovnik, and Karen Egiazarian. Bm3d frames and variational image deblurring. IEEE TIP, 21(4):1715–1728, 2012.
- [Davis and Yin2016] Damek Davis and Wotao Yin. Convergence rate analysis of several splitting schemes. In Splitting Methods in Communication, Imaging, Science, and Engineering, pages 115–163. 2016.
- [Deng et al.2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- [He and Wang2014] Liangtian He and Yilun Wang. Iterative support detection-based split bregman method for wavelet frame-based image inpainting. IEEE TIP, 23(12):5470–5485, 2014.
- [Krishnan and Fergus2009] Dilip Krishnan and Rob Fergus. Fast image deconvolution using hyper-laplacian priors. In NIPS, pages 1033–1041, 2009.
- [Kruse et al.2017] Jakob Kruse, Carsten Rother, and Uwe Schmidt. Learning to push the limits of efficient fft-based image deconvolution. In ICCV, pages 4596–4604, 2017.
- [Lin et al.2011] Zhouchen Lin, Risheng Liu, and Zhixun Su. Linearized alternating direction method with adaptive penalty for low-rank representation. In NIPS, 2011.
[Lin et al.2015]
Zhouchen Lin, Risheng Liu, and Huan Li.
Linearized alternating direction method with parallel splitting and adaptive penalty for separable convex programs in machine learning.Machine Learning, 99(2):287, 2015.
- [Liu et al.2016] Risheng Liu, Guangyu Zhong, Junjie Cao, Zhouchen Lin, Shiguang Shan, and Zhongxuan Luo. Learning to diffuse: A new perspective to design pdes for visual analysis. IEEE TPAMI, 38(12):2457–2471, 2016.
- [Liu et al.2018] Risheng Liu, Xin Fan, Shichao Cheng, Xiangyu Wang, and Zhongxuan Luo. Proximal alternating direction network: A globally converged deep unrolling framework. In AAAI, 2018.
- [Passty1979] Gregory B Passty. Ergodic convergence to a zero of the sum of monotone operators in hilbert space. Journal of Mathematical Analysis and Applications, 72(2):383–390, 1979.
- [Roth and Black2009] Stefan Roth and Michael J. Black. Fields of experts. IJCV, 82(2):205–229, 2009.
- [Schmidt et al.2016] Uwe Schmidt, Jeremy Jancsary, Sebastian Nowozin, Stefan Roth, and Carsten Rother. Cascades of regression tree fields for image restoration. IEEE TPAMI, 38(4):677–689, 2016.
- [Schuler et al.2013] Christian J Schuler, Harold Christopher Burger, Stefan Harmeling, and Bernhard Scholkopf. A machine learning approach for non-blind image deconvolution. In CVPR, pages 1067–1074, 2013.
- [Shi et al.2016] Hao-Jun Michael Shi, Shenyinying Tu, Yangyang Xu, and Wotao Yin. A primer on coordinate descent algorithms. arXiv preprint arXiv:1610.00040, 2016.
- [Sun et al.2013] Libin Sun, Sunghyun Cho, Jue Wang, and James Hays. Edge-based blur kernel estimation using patch priors. In ICCP, 2013.
- [Uwe and Stefan2014] Schmidt Uwe and Roth Stefan. Shrinkage fields for effective image restoration. In CVPR, pages 2774–2781, 2014.
- [Wang et al.2008] Yilun Wang, Junfeng Yang, Wotao Yin, and Yin Zhang. A new alternating minimization algorithm for total variation image reconstruction. SIAM Journal on Imaging Sciences, 1(3):248–272, 2008.
- [Xu and Yin2017] Yangyang Xu and Wotao Yin. A globally convergent algorithm for nonconvex optimization based on block coordinate update. Journal of Scientific Computing, pages 1–35, 2017.
- [Yang et al.2016] Yingzhen Yang, Jiashi Feng, Nebojsa Jojic, Jianchao Yang, and Thomas S Huang. ell^0-sparse subspace clustering. In ECCV, pages 731–747, 2016.
[Yang et al.2017]
Yan Yang, Jian Sun, Huibin Li, and Zongben Xu.
Admm-net: A deep learning approach for compressive sensing mri.In NIPS, 2017.
- [Zhang et al.2014] Jian Zhang, Debin Zhao, Ruiqin Xiong, Siwei Ma, and Wen Gao. Image restoration using joint statistical modeling in a space-transform domain. IEEE TCSVT, 24(6):915–928, 2014.
- [Zhang et al.2017] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. Learning deep cnn denoiser prior for image restoration. In CVPR, 2017.
- [Zoran and Weiss2011] Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image restoration. In ICCV, pages 479–486, 2011.