1 Introduction
With the emergence of mobile devices, the amount of user captured and shared images and videos rapidly increases. A huge space for storing and a wide bandwidth for transmitting such data are required if without reducing their file sizes properly. Image and video compression techniques have been designed to reduce the file size meanwhile preserve the visual quality of the frames. JPEG [1], MPEG and H.26x [22, 20] are classic and widely used standards in its history, which employ the block Discrete Cosine Transform (DCT), due to its good energy compaction and decorrelation properties, to achieve the compression. However, an inevitable problem of these standards is that as the compression ratio increases, the fidelity of coded images degrades, i.e. details are ruined and artificial block boundaries appear. The compression artifacts are perceptually annoying, and more importantly, very likely to degenerate the performance of many computer vision algorithms that are primarily designed for uncompressed images or videos, such as image enhancement [27, 19, 12, 25]
[26, 18], oversegmentation [10, 2, 17]and superresolution
[29, 7]. Hence, the technique for removing or reducing these artifacts is desirable.Considering the flexibility to existing codecs makes postprocessing approaches attractive, which handle compressed frames at the decoder end, without changing the maturing structure of existing codecs. Mathematically, the compressed image/video sequence can be modeled as a linear combination of two components: , where and represent the intrinsic layer and the artifact layer, respectively (e.g. Fig. 1). In the last decades, significant research has been made towards the development of postprocessing style deblocking techniques, which can be broadly categorized into two different groups, namely the denoisingstyle deblocker and the restorationstyle one.
The denoisingstyle deblockers attempt to suppress the effect of by (adaptive) local filters. Very first work proposed by Lim and Reeve [15] employs the low pass filter on boundaries, which may also blur intrinsic edges of the image. To address this problem, techniques that adaptively perform filtering on regions obtained by either classification or detection have been proposed [21, 8]. The recent video coding standard, H.264/AVC [20], analyzes artifacts and chooses different filters for different block boundaries according to their local properties. WNNM [11] and (V)BM3D [6] have the same goal of reducing artifacts, although they are originally designed for denoising by utilizing repetitive patterns in the target images or videos. These filtering methods consider the artifacts as noises to be smoothed for visual improvement.
However, in general, this kind of deblockers aims at heuristically smoothing visible artifacts without objective criterion, instead of genuinely restoring the original information.
Alternatively, the restorationstyle methods focus on recovering under some assumptions. Various priors have been exploited [9, 4, 23, 3]. Jung et al. attempt to reconstruct the intrinsic layer via sparse representation, which, however, requires the compression ratio is known and the dictionary is welllearned [13]. Similarly to [13], Choi et al. [5] propose a learning based approach to reduce JPEG artifacts for providing more accurate results in image matting. More recently, Sun and Liu [24]
introduce a noncausal temporal prior for video deblocking, which iteratively refines the target frames and the estimation of motion across them. Due to the iterative procedure and the optical flow estimation, the computational load of this approach is very heavy, which limits its applicability. Li
et al. [14] develop a fourstep method including structuretexture decomposition, scene detail extraction, block artifact reduction and layer recomposition. This approach can produce promising results when the whole or a big part of image with poor texture. In other words, the block artifacts in poor texture regions are well suppressed. Otherwise, its performance sharply degrades. Usually, the recovered results obtained by the restorationstyle methods are of better quality than those by the denoising ones. But they are either time consuming and complex (hard to be applied to real world tasks), or case dependent (short of generality).As can be seen from the aforementioned methods, the characteristics of the two layers have been well investigated individually, the relationship between the two layers, however, has been rarely studied. In this paper, we show how to decompose the intrinsic and artifact layers for an image or a video sequence by exploiting some strong structural layer priors in both the two layers. The main contributions of this paper can be summarized as follows:

We propose an effective onestep visual data deblocking method DSLP that harnesses two structural layer priors, i.e. 1) the independence between the gradient fields of the two layers, and 2) the sparsity of the gradient field of the intrinsic layer, in a unified fashion.

We design a novel Augmented Lagrange Multiplier based algorithm to efficiently and effectively seek the solution of the associated optimization problem. To demonstrate the efficacy and the superior performance of the proposed algorithm over the stateoftheart alternatives, extensive experiments are conducted.
2 Deblocking using Structural Layer Priors
2.1 Notations
We first introduce some notations used in this paper. Lowercase letters mean scalars, bold lowercase letters vectors, while bold uppercase letters matrices. Specifically, and
stand for the identity matrix and matrix of all ones with compatible dimensions. The vectorization operation of a matrix
is to convert a matrix into a vector. Bold calligraphic uppercase lettersrepresent high order tensors.
denotes an order tensor, whose elements are represented by . means the mode fiber of at , which is the higher order analogue of matrix rows and columns. The Frobenius and norms of are respectively defined as and , while the norm is the number of nonzero elements in . The inner product of two tensors with identical size is computed as . represents the nonuniform shrinkage operator, the definition of which is that, for each element in , . And means the Hadamard product of two tensors with same size. The mode unfolding of is to convert a tensor into a matrix, i.e. . Moreover, we denote . The mode folding transforms to , say . And the operator is to reshape back to . It is clear that, for any , , , , and .2.2 Problem Formulation
To be general, we employ tensors as the information container. For instance, a gray image is a order tensor, a color image order, while a color video order. Recall that the compressed image or video sequence is superimposed by the intrinsic and artifact components: . From this model, however, we can see that the number of unknowns to be recovered is twice as many as that of the given measurements, which indicates that the problem is highly illposed. Therefore, without additional knowledge, the decomposition problem is intractable as it has infinitely many solutions and thus, it is impossible to identify which of these candidate solutions is indeed the “correct” one. To make the problem wellposed, we impose additional structural layer priors on the desired solution for and . Before detailing the structural layer priors and the formulation for the problem, we first define the tensor mode derivative response and generalized tensor gradient.
Definition 1.
(Tensor Mode Derivative Response.) The derivative response of an order tensor along mode () fibers is defined as:
where is the vertical derivative filter and is the operator of convolution.
Definition 2.
(Generalized Tensor Gradient.) The generalized gradient of an order tensor is defined as:
which is analogue to the definition of matrix gradient.
Please notice that, for an image (, and are its width, height and color channel, respectively) and a video sequence ( is the number of frames), the derivative response across different color channels typically does not have statistical meaning, which is therefore omitted for the rest of the paper. Furthermore, for clarity, we denote and as the spatial response operators in vertical and horizontal directions respectively, while the temporal response operator. As a consequence, the gradient of images is and the gradient of videos is .
Structural layer priors for the problem.
It is well known that natural images or videos are largely piecewise smooth in both spatial and temporal, and the gradient field of intrinsic component is typically sparse. We call this the gradient sparsity prior. In addition, the gradient fields of the two layers should be statistically (approximately) uncorrelated. Thus, we note this as the gradient independence prior. Furthermore, we observe that the fraction of artifact in pixel values is usually much smaller than that of intrinsic.
Based on the priors and the observation stated above, the desired decomposition () should minimize the following objective:
(1)  
where , and are the weights controlling the importances of different terms, and that can be computed beforehand. can be either for images or for videos. In the objective function (1), the first term restricts that the artifact layer should be light, which is treated as a Gaussian noise. The second term essentially enforces the recovered intrinsic layer to have sparse gradient field. And the remaining two terms constrain the gradient fields of the two layers to be independent of each other. More specifically, the third term penalizes the overlapping of the gradient fields of the two layers, while the fourth enforces that, gradients do not appear in the observation should not be groundlessly generated in both the two layers, and existing gradients would also not be gratuitously erased.
The formulation of the problem (1) can be further simplified according to the following theorem.
Theorem 1.
Suppose we are given an order tensor , there exists a functional matrix satisfying , for any and .
Proof.
It is well known that can be alternatively computed by , where has the same functional behavior with the corresponding derivative filter. Similarly, there is a permutation matrix that can transform to . So we have based on the property of permutation matrix , which indicates is the desired matrix. ∎
2.3 Optimization
It can be seen in the objective function (2), all aforementioned priors and observation have been taken into account in a unified optimization framework for recovering the two layers. However, the objective is difficult to directly optimize due to the nonconvexity of the terms. The convex relaxation for these terms is an effective manner to make the problem tractable. Hence, we replace the norm with its tightest convex surrogate, namely the norm. The optimization problem can be rewritten as:
(3)  
The Augmented Lagrange Multiplier (ALM) with Alternating Direction Minimizing (ADM) strategy [16] has proven to be an efficient and effective solver of problems like (3). To adopt ALMADM to our problem, we need to make our objective function separable. Thus we introduce two auxiliary variables and to replace and , respectively in the objective function (3). Accordingly, and act as the additional constraints. Naturally, the formulation (3) can be modified as:
(4)  
Converting the constrained minimizing problem (4) to the unconstrained gives the augmented Lagrangian function of (4) as follows:
(5) 
with the definition , where is a positive penalty scalar and, , and are the Lagrangian multipliers. Besides the Lagrangian multipliers, there are four variables, including , , and , to solve. The solver iteratively updates one variable at a time by fixing the others. Fortunately, each step has a simple closedform solution, and hence can be computed efficiently. The solutions of the subproblems are as follows:
subproblem: With other terms fixed, we have:
(6) 
For computing , we take derivative of (6) with respect to and set it to zero, which gives:
(7) 
where for brevity. Directly calculating the inverse of the matrix is intuitive for solving . But if the matrix size is relatively large like in our problem, the inverse operation is very expensive. Fortunately, by assuming circular boundary conditions, we can apply FFT techniques on this problem, which enables us to efficiently compute the solution as:
(8) 
(9) 
where and stand for the FFT and inverse FFT operators, respectively. The division in (8) is elementwise.
subproblem: Discarding the unrelated terms provides:
(10) 
Similarly to the subproblem, the updating of can be done in the following manner:
(11) 
(12) 
with .
subproblem: Let us now focus on updating , which corresponds to the following optimization problem:
(13) 
The closed form solution is obtained by:
(14) 
subproblem: The updating of is analogue to that of . The associated optimization problem is:
(15) 
Similarly, the closed form solution of (15) looks like:
(16) 
Multipliers and : Besides, there are the multipliers and need to be updated, which can be simply accomplished by:
(17)  
3 Experiments
Parameter Effect. Our model involves three free parameters including , and . We here test the effect of each parameter. Although the quality assessment for the task of deblocking is questionable [28], we still employ some to reflect the trend of varying parameters. The most widely used full reference quality assessment might be the peak signaltonoise ratio (PSNR), which is mathematically simple, but does not correlate well with perceived visual quality. So we do not employ PSNR to quantitatively measure the performance in this paper. Alternatively, the structural similarity (SSIM) metric tries to measure how similar a pair of images are (the deblocked result and its original), which considers three aspects of similarity including luminance, contrast and structure, and thus is more appropriate than PSNR. In addition, we introduce a novel metric called gradient consistency (GC) to corporate with SSIM, which is defined as follows:
(18) 
where is the reference and the recovered. GC is to see the consistency of gradients of two individuals. Please notice that the higher SSIM the better, while the lower GC the better. Because the dependence of the three parameters is complex, we test them separately. For , we fix and to and , respectively. As can be viewed in Fig. 2, the best values change from for the case with JPEG quality to for the case with JPEG quality in terms of both SSIM and GC. This result is consistent with the fact that more artifacts require more powerful smoother to eliminate. As for , we can observe from the second row of Fig. 2 that it performs stably in the range for JPEG quality and for JPEG quality , respectively. Similarly, the parameter can achieve high performance when it is set to a relatively large value for both the two cases shown in Fig. 2. For the rest experiments, we will fix and to and , respectively.
Convergence Speed. Figure 3 displays the convergence speed of the proposed Algorithm 1, without loss of generality, on the image shown in Fig. 1, in which the stop criterion sharply drops to the level of with about iterations and to with iterations. We also show four pairs of the separated layers at , , and iterations. We see that the results at iterations is very close to those at .
Relationship to TV model. From the objective function (3), we can observe that our model can reduce to the anisotropic Total Variation (TV) model by disabling the third and fourth terms, say the gradient independence prior. To demonstrate the benefit of the gradient independence prior, we conduct a comparison between TV and our method. To better view the difference, we do not introduce artifacts into the testing. As shown in Fig. 4, bigger leads to more details smoothed for both TV and DSLP. The difference is that, in terms of visual quality, TV smooths both the highfrequency and lowfrequency information, while our DSLP eliminates weak textures but keeps dominant edges. Quantitatively, when setting to 1.0, DSLP achieves SSIM and GC, which are much better than those of TV, i.e. SSIM and GC. The results of are analogue. Please note that even increasing to , DSLP still can provide very promising result. From the viewpoint of artifact, we further give an example shown in Fig. 5 to see the power of the independence prior. For better view, we amplify the artifact to times of it. As can be seen, TV greatly filters textures with very high false positive ratio (the details of bird body), while DSLP mainly focuses on the block artifacts. The above experimental results reveal the relationship and the difference between TV and DSLP, and demonstrate the advance of DSLP.
IDSLP: Improved DSLP. Let us here revisit the complication of JPEG compression in terms of visual quality. As can be viewed in the first image of Fig. 6 (JPEG Quality ), there are actually two main issues, say the staircase effect around block boundaries as well as the serration along image edges. The denoising techniques like BM3D [6] can reduce the serration in the frame, but hardly deal with the staircase effect, as shown in the second picture of Fig. 6 (setting ). As for DSLP, it is good at cleaning the staircase around block boundaries but leaves the serration (see the third picture in Fig. 6, setting ). Intuitively, we can further improve the visual quality by making use of their respective advantages. The most right result in Fig. 6 demonstrates the effectiveness of such a strategy, which is obtained by firstly executing the denoising technology (in this paper we adopt BM3D, ) and then applying DSLP on the denoised version ().
Image Deblocking. In this part, we evaluate the performance of our method on image deblocking, compared with the stateofthe art alternatives including a reconstruction based method using Field of Experts (FoE) [23], a local filtering based method via Shape Adaptive DCT (SADCT) [8], a layer decomposition based method for JPEG Artifact Suppression (JAS) [14], a denoising based method BM3D [6], a Total Variation regularized restoration method (TV) [4], and our proposed DSLP and IDSLP. The codes for the competitors are either downloaded from the authors’ websites or provided by the authors, their parameters are tuned or set as suggested by the authors for obtaining their best possible results. As for DSLP on image deblocking, only spatial gradients are taken into account, say . In addition, all the codes are implemented in Matlab, which assures the fairness of time cost comparison. We provide the quantitative (SSIM, GC and Time) and qualitative results on several images in Fig. 7, which are compressed by JPEG with quality . As can be seen from Fig. 7, FoE, SADCT, JAS and BM3D can only slightly suppress but not thoroughly eliminate the staircase effect under such a compression rate. DSLP is able to eliminate or largely reduce the staircase, while IDSLP can further mitigate the effect of edge serration. In terms of computational cost, DSLP is superior to SADCT and FoE, and competitive with JAS and TV, but inferior to BM3D. Moreover, IDSLP integrates the denoising and deblocking components, and thus its time cost sums up those of BM3D (for this paper) and DSLP. Due to the limited space and the nature of the deblocking problem, so please see the supplementary material for larger and more results, which are best viewed in original sizes.
Video Deblocking. For this task, we test both spatial only gradients and spatialtemporal gradients for (I)DSLP, which are denoted as (I)DSLP and (I)VDSLP, respectively. This comparison involves VBM3D that is a video extension of BM3D, DSLP, IDSLP and IVDSLP.^{1}^{1}1Another related video deblocking method is [24], but its code is not available when this paper is prepared. Therefore, we do not compare with it. Moreover, with regard to time cost, as the authors of [24] stated, their C++ implementation takes about hours to process frame sequence, which significantly limits its applicability. From Fig. 8, we can see that the problem for BM3D on image deblocking still exists for VBM3D on video deblocking. In other words, the staircase remains (see yellow arrows). DSLP significantly reduces the staircase effect, while IDSLP and IVDSLP further take care of the serration. We note that, compared with IDSLP, IVDSLP slightly excludes some textures (e.g. the leaves on the topright corner, white arrows). This is because the temporal gradient is enforced to be sparse, which would be more helpful for videos with slow motions, but oversmooth the content of videos with sudden or fast motions. More video results can be found in the supplementary.
4 Conclusion
Artifact separation from images or video sequences is an important, yet severely illposed problem. To overcome its difficulty, this paper has shown how to harness two prior structures of the intrinsic and artifact layers, including the gradient sparsity of the intrinsic layer and the gradient independence between the two components, to make the problem welldefined and feasible to solve. We have formulated the problem in a unified optimization framework and proposed an efficient algorithm to find the optimal solution. The experimental results, compared to the state of the arts, have demonstrated the clear advantages of the proposed method in terms of visual quality and simplicity, which can be used for many advanced image/video processing tasks.
References
 [1] Information technology digital compression and coding of continuoustone still images requirements and guidelines. Technical Report Tech.Rep.ISO/IEC 109181 and ITU T.81, International Telecommunication Union, 1992.
 [2] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. SLIC superpixels compared to stateoftheart superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012.

[3]
H. Burger, C. Schuler, and S. Harmeling.
Image denoising: Can plain neural networks compete with bm3d.
InProceedings of IEEE Conference on Computer Vision and Pattern Recognition
, pages 2392–2399, 2012.  [4] S. Chan, R. Khoshabeh, K. Gibson, P. Gill, and T. Nguyen. An augmented lagrangian method for total variation video restoration. IEEE Transactions on Image Processing, 20(11):3097–3111, 2011.
 [5] I. Choi, S. Kim, M. Brown, and Y. Tai. A learningbased approach to reduce JPEG artifacts in image matting. In Proceedings of International Conference on Computer Vision, pages 2880–2887, 2013.
 [6] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3d transformdomain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, 2007.
 [7] C. FernandezGranda and E. Candès. Superresolution via transforminvariant groupsparse regularization. In Proceedings of International Conference on Computer Vision, pages 3336–3343, 2013.
 [8] A. Foi, V. Katkovnik, and K. Egiazarian. Pointwise shapeadaptive dct for highquality denoising and deblocking of grayscale and color images. IEEE Transactions on Image Processing, 16(5):1395–1411, 2007.
 [9] T. Goto, Y. Kato, S. Hirano, M. Sakurai, and T. Nguyen. Compression artifact reduction based on total variation regularization method for mpeg2. IEEE Transactions on Consumer Electronics, 57(1):253–259, 2011.
 [10] S. Gould, J. Zhao, X. He, and Y. Zhang. Superpixel graph label transfer with learned distance metric. In Proceedings of European Conference on Computer Vision, pages 632–647, 2014.
 [11] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with applications to image denoising. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2862–2869, 2014.
 [12] K. He, J. Sun, and X. Tang. Single image haze removal using dark channel prior. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1956–1963, 2009.
 [13] C. Jung, L. Jiao, H. Qi, and T. Sun. Image deblocking via sparse representation. Singal Processing: Image Communication, 27(6):663–677, 2012.
 [14] Y. Li, F. Guo, R. Tan, and M. Brown. A contrast enhancement framework with jpeg artifacts suppression. In Proceedings of European Conference on Computer Vision, pages 174–188, 2014.
 [15] J. Lim and H. Reeve. Reducing of blocking effect in image coding. Journal of Optical Engineering, 23:34–37, 1984.
 [16] Z. Lin, R. Liu, and Z. Su. Linearized alternating direction method with adaptive penalty for low rank representation. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 612–620, 2011.
 [17] M. Liu, O. Tuzel, S. Ramalingam, and R. Chellappa. Entropy rate superpixel segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2097–2104, 2011.
 [18] D. Lowe. Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
 [19] G. Meng, Y. Wang, J. Duan, S. Xiang, and C. Pan. Efficient image dehazing with boundary constraint and contextual regularization. In Proceedings of International Conference on Computer Vision, pages 617–624, 2013.
 [20] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. Stockhammer, and T. Wedi. Video coding with h.264/avc: Tools, performance, and complexity. IEEE Circuits and Systems Magazine, 4(1):7–28, 2004.
 [21] B. Ramamurthi and A. Gersho. Nonlinear spacevariant postprocessing of block coded images. IEEE Transactions on Acoustics, Speech and Signal Processing, 34(5):1258–1268, 1986.
 [22] K. Rijkse. H.263: Video coding for lowbitrate communication. IEEE Communications Magzine, 34(12):42–45, 1996.
 [23] D. Sun and W. Cham. Postprocessing of low bitrate block dct coded images based on a fields of experts prior. IEEE Transactions on Image Processing, 16(11):2743–2751, 2007.
 [24] D. Sun and C. Liu. Noncausal temporal prior for video deblocking. In Proceedings of European Conference on Computer Vision, pages 510–523, 2012.
 [25] K. Tang, J. Yang, and J. Wang. Investigating hazerelevant features in a learning framework for image dehazing. In Proceedings of European Conference on Computer Vision, pages 2995–3002, 2014.
 [26] Z. Wang, B. Fan, and F. Wu. Affine subspace representation for feature description. In Proceedings of European Conference on Computer Vision, pages 94–108, 2014.
 [27] J. Yan, S. Lin, S. Kang, and X. Tang. A learningtorank approach for image color enhancement. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2987–2994, 2014.
 [28] C. Yim and A. Bovik. Quality assessment of deblocked images. IEEE Transactions on Image Processing, 20(1):88–98, 2011.
 [29] Y. Zhu, Y. Zhang, and A. Yuille. Single image superresolution using deformable patches. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2917–2924, 2014.
Comments
There are no comments yet.