Image restoration is a process of reconstructing a clean image from a degraded observation. The observed data is assumed to be related to the ideal image through a forward imaging model that accounts for noise, blurring, and sampling. However, a simple modeling only with the observed data is insufficient for an effective restoration, and thus a priori constraint about the solution is commonly used. To this end, the image restoration is usually formulated as an energy minimization problem with an explicit regularization function (or regularizer). Recent work on joint restoration leverages a guidance signal, captured from different devices, as an additional cue to regularize the restoration process. These approaches have been successfully applied to various applications including joint upsampling , cross-field noise reduction , dehazing , and intrinsic image decomposition .
The regularization-based image restoration involves the minimization of non-convex and non-smooth energy functionals for yielding high-quality restored results. Solving such functionals typically requires a huge amount of iterations, and thus an efficient optimization is preferable, especially in some applications the runtime is crucial. One of the most popular optimization methods is the alternating minimization (AM) algorithm  that introduces auxiliary variables. The energy functional is decomposed into a series of subproblems that is relatively simple to optimize, and the minimum with respect to each of the variables is then computed. For the image restoration, the AM algorithm has been widely adopted with various regularization functions, e.g., total variation , norm , and norm (hyper-Laplacian) . It is worth noting that these functions are all handcrafted models. The hyper-Laplacian of image gradients  reflects the statistical property of natural images relatively well, but the restoration quality of gradient-based regularization methods using the handcrafted model is far from that of the state-of-the-art approaches [30, 9]. In general, it is non-trivial to design an optimal regularization function for a specific image restoration problem.
Over the past few years, several attempts have been made to overcome the limitation of handcrafted regularizer by learning the image restoration model from a large-scale training data [30, 9, 39]. In this work, we propose a novel method for image restoration that effectively uses a data-driven approach in the energy minimization framework, called deeply aggregated alternating minimization (DeepAM). Contrary to existing data-driven approaches that just produce the restoration results from the convolutional neural networks (CNNs), we design the CNNs to implicitly learn the regularizer of the AM algorithm. Since the CNNs are fully integrated into the AM procedure, the whole networks can be learned simultaneously in an end-to-end manner. We show that our simple model learned from the deep aggregation achieves better results than the recent data-driven approaches [30, 9, 17] as well as the state-of-the-art nonlocal-based methods [10, 12].
Our main contributions can be summarized as follows:
We design the CNNs to learn the regularizer of the AM algorithm, and train the whole networks in an end-to-end manner.
We introduce the aggregated (or multivariate) mapping in the AM algorithm, which leads to a better restoration model than the conventional point-wise proximal mapping.
We extend the proposed method to joint restoration tasks. It has broad applicability to a variety of restoration problems, including image denoising, RGB/NIR restoration, and depth super-resolution.
2 Related Work
Regularization-based image restoration
Here, we provide a brief review of the regularization-based image restoration. The total variation (TV)  has been widely used in several restoration problems thanks to its convexity and edge-preserving capability. Other regularization functions such as total generalized variation (TGV)  and norm  have also been employed to penalize an image that does not exhibit desired properties. Beyond these handcrafted models, several approaches have been attempted to learn the regularization model from training data [30, 9]. Schmidt et al.  proposed a cascade of shrinkage fields (CSF) using learned Gaussian RBF kernels. In , a nonlinear diffusion-reaction process was modeled by using parameterized linear filters and regularization functions. Joint restoration methods using a guidance image captured under different configurations have also been studied [3, 11, 31, 17]. In , an RGB image captured in dim light was restored using flash and non-flash pairs of the same scene. In [11, 15], RGB images was used to assist the regularization process of a low-resolution depth map. Shen et al.  proposed to use dark-flashed NIR images for the restoration of noisy RGB image. Li et al. used the CNNs to selectively transfer salient structures that are consistent in both guidance and target images .
Use of energy minimization models in deep network
The CNNs lack imposing the regularity constraint on adjacent similar pixels, often resulting in poor boundary localization and spurious regions. To deal with these issues, the integration of energy minimization models into CNNs has received great attention [24, 38, 25, 26]. Ranftl et al. 
defined the unary and pairwise terms of Markov Random Fields (MRFs) using the outputs of CNNs, and trained network parameters using the bilevel optimization. Similarly, the mean field approximation for fully connected conditional random fields (CRFs) was modeled as recurrent neural networks (RNNs). A nonlocal Huber regularization was combined with CNNs for a high quality depth restoration . Riegler et al.  integrated anisotropic TGV into the top of deep networks. They also formulated the bilevel optimization problem and trained the network in an end-to-end manner by unrolling the TGV minimization. Note that the bilevel optimization problem is solvable only when the energy minimization model is convex and is twice differentiable . The aforementioned methods try to integrate handcrafted regularization models into top of the CNNs. In contrast, we design the CNNs to parameterize the regularization process in the AM algorithm.
3 Background and Motivation
The regularization-based image reconstruction is a powerful framework for solving a variety of inverse problems in computational imaging. The method typically involves formulating a data term for the degraded observation and a regularization term for the image to be reconstructed. An output image is then computed by minimizing an objective function that balances these two terms. Given an observed image and a balancing parameter , we solve the corresponding optimization problem111For the super-resolution, we treat as the bilinearly upsampled image from the low-resolution input.:
denotes the , where (or ) is a discrete implementation of -derivative (or -derivative) of the image. is a regularization function that enforces the output image to meet desired statistical properties. The unconstrained optimization problem of (1) can be solved using numerous standard algorithms. In this paper, we focus on the additive form of alternating minimization (AM) method , which is the ad-hoc for a variety of problems in the form of (1).
3.1 Alternating Minimization
The idea of AM method is to decouple the data and regularization terms by introducing a new variable and to reformulate (1) as the following constrained optimization problem:
where is the penalty parameter. The AM algorithm consists of repeatedly performing the following steps until convergence.
Minimizing the first step in (4) varies depending on the choices of the regularization function and . This step can be regarded as the proximal mapping  of associated with . When is the sum of or norm, it amounts to soft or hard thresholding operators (see Fig. 1 and  for various examples of this relation). Such mapping operators may not unveil the full potential of the optimization method of (4), since and are chosen manually. Furthermore, the mapping operator is performed for each pixel individually, disregarding spatial correlation with neighboring pixels.
Building upon this observation, we propose the new approach in which the regularization function and the penalty parameter are learned from a large-scale training dataset. Different from the point-wise proximal mapping based on the handcrafted regularizer, the proposed method learns and aggregates the mapping of through CNNs.
4 Proposed Method
In this section, we first introduce the DeepAM for a single image restoration, and then extend it to joint restoration tasks. In the following, the subscripts and
denote the location of a pixel (in a vector form).
4.1 Deeply Aggregated AM
We begin with some intuition about why our learned and aggregated mapping is crucial to the AM algorithm. The first step in (4) maps with a small magnitude into zero since it is assumed that they are caused by noise, not an original signal. Traditionally, this mapping step has been applied in a point-wise manner, not to mention whether it is learned or not. With , Schmidt et al.  modeled the point-wise mapping function as Gaussian RBF kernels, and learned their mixture coefficients222When , the first step in (4) is separable with respect to each . Thus, it can be modeled by point-wise operation.. Contrarily, we do not presume any property of . We instead train the multivariate mapping process () associated with and by making use of the CNNs. Figure 2 shows the denoising examples of TV , CSF , and ours. Our method outperforms other methods using the point-wise mapping based on handcrafted model (Fig. 2(b)) or learned model (Fig. 2(c)) (see the insets).
We reformulate the original AM iterations in (4) with the following steps333The gradient operator is absorbed into the CNNs..
where denotes a convolutional network parameterized by and . Note that is completely absorbed into the CNNs, and fused with the balancing parameter (which will also be learned). is estimated by deeply aggregating through CNNs. This formulation allows us to turn the optimization procedure in (1) into a cascaded neural network architecture, which can be learned by the standard back-propagation algorithm .
The solution of (6) satisfies the following linear system:
where the Laplacian matrix . It can be seen that (7) plays a role of naturally imposing the spatial and appearance consistency on the intermediate output image using a kernel matrix . The linear system of (7) becomes the part of deep neural network (see Fig. 3). When is a constant, the block Toeplitz matrix
is diagonalizable with the fast Fourier transform (FFT). However, in our framework, the direct application of FFT is not feasible sinceis spatially varying for the adaptive regularization. Fortunately, the matrix is still sparse and positive semi-definite as the simple gradient operator is used. We adopt the preconditioned conjugate gradient (PCG) method to solve the linear system of (7). The incomplete Cholesky factorization  is used for computing the preconditioner.
Although this is conceptually similar to our aggregation approach444Aggregation using neighboring pixels are commonly used in state-of-the-arts denoising methods., the operator in  still relies on the handcrafted model. Figure 3 shows the proposed learning model for image restoration tasks. The DeepAM, consisting of deep aggregation network, -parameter network, guidance network (which will be detailed in next section), and reconstruction layer, is iterated times, followed by the loss layer.
shows the denoising result of our method. Here, it is trained with three passes of DeepAM. The input image is corrupted by Gaussian noise with standard deviation. We can see that as iteration proceeds, the high-quality restoration results are produced. The trained networks in the first and second iterations remove the noise, but intermediate results are over smoothed (Figs. 4(a) and (b)). The high-frequency information is then recovered in the last network (Fig. 4(c)). To analyze this behavior, let us date back to the existing soft-thresholding operator, in . The conventional AM method sets as a small constant and increases it during iterations. When is small, the range of is shrunk, penalizing large gradient magnitudes. The high-frequency details of an image are recovered as increases. Interestingly, the DeepAM shows very similar behavior (Figs. 4(d)-(f)), but outperforms the existing methods thanks to the aggregated mapping through the CNNs, as will be validated in experiments.
4.2 Extension to Joint Restoration
In this section, we extend the proposed method to joint restoration tasks. The basic idea of joint restoration is to provide structural guidance, assuming structural correlation between different kinds of feature maps, e.g., depth/RGB and NIR/RGB. Such a constraint has been imposed on the conventional mapping operator by considering structures of both input and guidance images . Similarly, one can modify the deeply aggregated mapping of (5) as follows:
where is a guidance image and denotes a concatenation operator. However, we find such early concatenation to be less effective since the guidance image mixes heterogeneous data. This coincides with the observation in the literature of multispectral pedestrian detection . Instead, we adopt the halfway concatenation similar to [18, 17]. Another sub-network is introduced to extract the effective representation of the guidance image, and it is then combined with intermediate features of (see Fig. 3).
4.3 Learning Deeply Aggregated AM
In this section, we will explain the network architecture and training method using standard back-propagation algorithm. Our code will be publicly available later.
One iteration of the proposed DeepAM consists of four major parts: deep aggregation network, -parameter network, guidance network (for joint restoration), and reconstruction layer, as shown in Fig. 3. The deep aggregation network consists of 10 convolutional layers with filters (a receptive field is of ). Each hidden layer of the network has 64 feature maps. Since
contains both positive and negative values, the rectified linear unit (ReLU) is not used for the last layer. The input distributions of all convolutional layers are normalized to the standard Gaussian distribution. The output channel of the deep aggregation network is 2 for the horizontal and vertical gradients. We also extract the spatially varying by exploiting features from the eighth convolutional layer of the deep aggregation network. The ReLU is used for ensuring the positive values of .
For joint image restoration, the guidance network consists of 3 convolutional layers, where the filters operate on spatial region. It takes the guidance image as an input, and extracts a feature map which is then concatenated with the third convolutional layer of the deep aggregation network. There are no parameters to be learned in the reconstruction layer.
The DeepAM is learned via standard back-propagation algorithm . We do not require the complicated bilevel formulation [24, 26]. Given training image pairs , we learn the network parameters by minimizing the loss function.
where and denote the ground truth image and the output of the last reconstruction layer in (7), respectively. It is known that loss in deep networks reduces splotchy artifacts and outperforms loss for pixel-level prediction tasks 
. We use the stochastic gradient descent (SGD) to minimize the loss function of (10). The derivative for the back-propagation is obtained as follows:
To learn the parameters in the network, we need the derivatives of the loss with respect to and
. By the chain rule of differentiation,can be derived from (7):
is obtained by solving the linear system of (12). Similarly for , we have:
where “” is an element-wise multiplication. Since the loss is a scalar value, and are and vectors, respectively, where is total number of pixels. More details about the derivations of (12) and (13) are available in the supplementary material. The system matrix is shared in (12) and (13), thus its incomplete factorization is performed only once.
). We find that a few PCG iterations are enough for the backpropagation. The average residual,on 20 images is 1.3, after 10 iterations. The table in Fig. 5 compares the runtime of PCG iterations and MATLAB backslash (on 256256 image). The PCG with 10 iterations is about 5 times faster than the direct linear system solver.
We jointly train our DeepAM for 20 epochs. From here on, we callthe method trained through a cascade of DeepAM iterations. The MatConvNet library  (with 12GB NVIDIA Titan GPU) is used for network construction and training. The networks are initialized randomly using Gaussian distributions. The momentum and weight decay parameters are set to 0.9 and 0.0005, respectively. We do not perform any pre-training (or fine-tuning). The proposed method is applied to single image denoising, depth super-resolution, and RGB/NIR restoration. The results for the comparison with other methods are obtained from source codes provided by the authors. Additional results and analyses are available in the supplementary material.
|PSNR / SSIM|
|BM3D ||MLP ||CSF ||TRD |
|15||31.12 / 0.872||-||31.24 / 0.873||31.42 / 0.882||31.40 / 0.882||31.65 / 0.885||31.68 / 0.886|
|25||28.61 / 0.801||28.84 / 0.812||28.73 / 0.803||28.91 / 0.815||28.95 / 0.816||29.18 / 0.824||29.21 / 0.825|
|50||25.65 / 0.686||26.00 / 0.708||-||25.96 / 0.701||25.94 / 0.701||26.20 / 0.714||26.24 / 0.716|
5.1 Single Image Denoising
We learned the from a set of , patches sampled from the BSD300  dataset. Here was set to 3 as the performance of the DeepAM converges after 3 iterations (refer to Table 2). The noise levels were set to , and . We compared against a variety of recent state-of-the-art techniques, including BM3D , WNNM , CSF , TRD , EPLL , and MLP . The first two methods are based on the nonlocal regularization and the others are learning-based approaches.
Table 1 shows the peak signal-to-noise ratio (PSNR) on the 12 test images . The best results for each image are highlighted in bold. The yields the highest PSNR results on most images. We could find that our deep aggregation used in the mapping step outperforms the point-wise mapping of the CSF  by 0.30.5dB. Learning-based methods tend to have better performance than handcrafted models. We, however, observed that the methods (BM3D  and WNNM ) based on the nonlocal regularization usually work better on images that are dominated by repetitive textures, e.g., ‘House’ and ‘Barbara’. The nonlocal self-similarity is a powerful prior on regular and repetitive texture, but it may lead to inferior results on irregular regions.
|BMP (3): NYU v2  / Middlebury |
|NMRF ||1.41 / 4.56||4.21 / 7.59||16.25 / 13.22|
|TGV ||1.58 / 5.72||5.42 / 8.82||17.89 / 13.47|
|SD filter ||1.27 / 2.41||3.56 / 5.97||15.43 / 12.18|
|DJF ||0.68 / 3.75||1.92 / 6.37||5.82 / 12.63|
|0.57 / 3.14||1.58 / 5.78||4.63 / 10.45|
Figure 6 shows denoising results using one image from the BSD68 dataset . The visually outperforms state-of-the-art methods. Table 2 summarizes an objective evaluation by measuring average PSNR and structural similarity indexes (SSIM)  on 68 images from the BSD68 dataset . As expected, our method achieves a significant improvement over the nonlocal-based method as well as the recent data-driven approaches. Due to the space limit, some methods were omitted in the table, and full performance comparison is available in the supplementary materials.
5.2 Depth Super-resolution
Modern depth sensors, e.g. MS Kinect, provide dense depth measurement in dynamic scene, but typically have a low resolution. A common approach to tackle this problem is to exploit a high-resolution (HR) RGB image as guidance. We applied our to this task, and evaluated it on the NYU v2 dataset  and Middlebury dataset . The NYU v2 dataset  consists of 1449 RGB-D image pairs of indoor scenes, among which 1000 image pairs were used for training and 449 image pairs for testing. Depth values are normalized within the range [0,255]. To train the network, we randomly collected RGB-D patch pairs of size from training set. A low-resolution (LR) depth image was synthesized by nearest-neighbor downsampling (, , and
). The network takes the LR depth image, which is bilinearly interpolated into the desired HR grid, and the HR RGB image as inputs.
uses an anisotropic diffusion tensor that solely depends on the RGB image. The major drawback of this approach is that the RGB-depth coherence assumption is violated in textured surfaces. Thus, the restored depth image may contain gradients similar to the color image, which causes texture copying artifacts (Fig.7(d)). Although the NMRF  combines several weighting schemes, computed from RGB image, segmentation, and initially interpolated depth, the texture copying artifacts are still observed (Fig. 7(c)). The NMRF  preserves depth discontinuities well, but shows poor results in smooth surfaces. The DJF  avoids the texture copying artifacts thanks to faithful CNN responses extracted from both color image and depth map (Fig. 7(e)). However, this method lacks the regularization constraint that encourages spatial and appearance consistency on the output, and thus it over-smooths the results and does not protect thin structures. Our preserves sharp depth discontinuities without notable artifacts as shown in Fig. 7(f). The quantitative evaluations on the NYU v2 dataset  and Middlebury dataset  are summarized in Table 3. The accuracy is measured by the bad matching percentage (BMP)  with tolerance 3.
5.3 RGB/NIR Restoration
The RGB/NIR restoration aims to enhance a noisy RGB image taken under low illumination using a spatially aligned NIR image. The challenge when applying our model to the RGB/NIR restoration is the lack of the ground truth data for training. For constructing a large training data, we used the indoor IVRL dataset consisting of 400 RGB/NIR pairs  that were recorded under daylight illumination555This dataset  was originally introduced for semantic segmentation.. Specifically, we generated noisy RGB images by adding the synthetic Gaussian noise with and , and used 300 image pairs for training.
In Table 4, we performed an objective evaluation using 5 test images in . The gives better quantitative results than other state-of-the-art methods [31, 10, 13]. Figure 8 compares the RGB/NIR restoration results of Cross-field , DJF , and our on the real-world example. The input RGB/NIR pair was taken from the project website of . This experiment shows the proposed method can be applied to real-world data, although it was trained from the synthetic dataset. It was reported in  that the restoration algorithm designed (or trained) to work under a daylight condition could also be used for both daylight and night conditions.
We have explored a general framework called the DeepAM, which can be used in various image restoration applications. Contrary to existing data-driven approaches that just produce the restoration result from the CNNs, the DeepAM uses the CNNs to learn the regularizer of the AM algorithm. Our formulation fully integrates the CNNs with an energy minimization model, making it possible to learn whole networks in an end-to-end manner. Experiments demonstrate that the deep aggregation in the mapping step is the critical factor of the proposed learning model. As future work, we will further investigate an adversarial loss in pixel-level prediction tasks.
-  http://faculty.cse.tamu.edu/davis/suitesparse.html/.
-  http://www.vlfeat.org/matconvnet/.
-  A. Agrawal, R. Raskar, S. Nayar, and Y. Li. Removing photography artifacts using gradient projection and flash-exposure sampling. ACM Trans. Graph., 24(3), 2005.
-  K. Bredies, K. Kunisch, and T. Pock. Total generalized variation. SIAM J. Imag. Sci., 3(3), 2010.
-  A. Buades, B. Coll, and J. Morel. A non-local algorithm for image denoising. CVPR, 2005.
-  H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: can plain neural networks compete with bm3d? CVPR, 2012.
-  S. Chan, X. Wang, and O. Elgendy. Plug-and-play admm for image restoration: fixed point convergence and applications. arXiv, 2016.
-  Q. Chen and V. Koltun. A simple model for intrinsic image decomposition with depth cues. ICCV, 2013.
-  Y. Chen, W. Yu, , and T. Pock. On learning optimized reaction diffusion processes for effective image restoration. CVPR, 2015.
-  K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3d transform-domain collaborative filtering. IEEE Trans. Image Process., 16(8), 2007.
-  D. Ferstl, C. Reinbacher, R. Ranftl, M. Ruther, and H. Bischof. Image guided depth upsampling using anisotropic total generalized variation. ICCV, 2013.
-  S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application to image denoising. CVPR, 2014.
-  B. Ham, M. Cho, and J. Ponce. Robust image filtering using joint static and dynamic guidance. CVPR, 2015.
-  H. Honda and L. V. G. R. Timofte. Make may day - high-fidelity color denoising with near-infrared. CVPRW, 2015.
-  Y. Kim, B. Ham, C. Oh, and K. Sohn. Structure selective depth superresolution for rgb-d cameras. IEEE Trans. Image Process., 25(11), 2016.
-  D. Krishnan and R. Fergus. Fast image deconvolution using hyper-laplacian priors. NIPS, 2009.
-  Y. Li, J. Huang, N. Ahuja, and M. Yang. Deep joint image filtering. ECCV, 2016.
-  J. Liu, S. Zhang, S. Wang, and D. Metaxas. Multispectral deep neural networks for pedestrian detection. BMVC, 2016.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. ICCV, 2001.
A focused back-propagation algorithm for temporal pattern recognition.Complex Systems, 3(4), 1989.
-  H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. ICCV, 2015.
-  N. Parikh and S. Boyd. Proximal algorithms. Found. and Trends in optimization, 2014.
-  J. Park, H. Kim, Y. W. Tai, M. S. Brown, and I. Kweon. High quality depth map upsampling for 3d-tof cameras. ICCV, 2011.
-  R. Ranftl and T. Pock. A deep variational model for image segmentation. GCPR, 2014.
-  G. Riegler, D. Ferstl, M. Rther, and H. Bischof. A deep primal-dual network for guided depth super-resolution. BMVC, 2016.
-  G. Riegler, M. Rther, and H. Bischof. Atgv-net: Accurate depth super-resolution. ECCV, 2016.
-  S. Roth and M. J. Black. Fields of experts. IJCV, 82(2), 2009.
-  N. Salamati, D. Larlus, G. Csurka, and S. Susstrunk. Incorporating near-infrared information into semantic image segmentation. arXiv, 2014.
-  D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV, 47(1).
-  U. Schmidt and S. Roth. Shrinkage fields for effective image restoration. CVPR, 2014.
-  X. Shen, Q. Yan, L. Xu, L. Ma, and J. Jia. Multispectral joint image restoration via optimizing a scale map. IEEE Trans. Pattern Anal. Mach. Intell., 1(1), 2015.
-  X. Shen, C. Zhou, L. Xu, and J. Jia. Mutual-structure for joint filtering. ICCV, 2015.
-  N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. ECCV, 2012.
-  Y. Wang, J. Yang, W. Yin, , and Y. Zhang. A new alternating minimization algorithm for total variation image reconstruction. SIAM J. Imag. Sci., 1(3), 2008.
-  Z. Wang, A. C. Bovik, H. Rahim, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process., 13(4), 2004.
-  L. Xu, C. Lu, Y. Xu, , and J. Jia. Image smoothing via gradient minimization. ACM Trans. Graph., 30(6), 2011.
-  H. Zhao, O. Gallo, I. Frosio, and J. Kautz. Loss functions for neural networks for image processing. arXiv, 2015.
-  S. Zheng, S. Jayasumana, B. Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional random fields as recurrent neural networks. ICCV, 2015.
-  D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. ICCV, 2011.