One-to-Many Network for Visually Pleasing Compression Artifacts Reduction

11/15/2016 ∙ by Jun Guo, et al. ∙ SUN YAT-SEN UNIVERSITY 0

We consider the compression artifacts reduction problem, where a compressed image is transformed into an artifact-free image. Recent approaches for this problem typically train a one-to-one mapping using a per-pixel L_2 loss between the outputs and the ground-truths. We point out that these approaches used to produce overly smooth results, and PSNR doesn't reflect their real performance. In this paper, we propose a one-to-many network, which measures output quality using a perceptual loss, a naturalness loss, and a JPEG loss. We also avoid grid-like artifacts during deconvolution using a "shift-and-average" strategy. Extensive experimental results demonstrate the dramatic visual improvement of our approach over the state of the arts.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 4

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Compression Artifacts Reduction is a classical problem in computer vision. This problem targets at estimating an artifact-free image from a lossily compressed image. In this age of information explosion, the number of images spreading on the Internet increases rapidly. Lossy compression (

e.g., JPEG [42], WebP [13], and HEVC-MSP [41]) is inevitably adopted for saving bandwidth and storage space. However, lossy compression in its nature leads to information loss and undesired artifacts, which severely reduces the user experience. Thus, how to recover visually pleasing artifact-free images has attracted more and more attention.

Given the fact that JPEG is the most extensively used lossy compression scheme across the world, in the following, we focus on discussing JPEG compression artifacts reduction. Various approaches have been proposed to suppress JPEG compression artifacts. Early works [38] manually developed filters to remove simple artifacts. Recently, learning-based approaches are occupying the dominant position. [22, 5, 2, 29, 28, 37]

proposed to reconstruct artifact-free images using sparse coding. These approaches can produce sharpened images but are usually accompanied with noisy edges and unnatural regions. To date, deep learning has been proved to possess great capability for vision tasks. In particular, ARCNN 

[8] and DDCN [15]

have demonstrated the power of deep convolutional neural networks (CNNs) in eliminating JPEG compression artifacts. D3 

[45] casted sparse coding into a deep fully-connected network and has also obtained impressive results. Nevertheless, state-of-the-art deep learning approaches don’t produce satisfactory outcomes either. The recovered images used to be overly smooth, containing significantly fewer textures when compared to the (uncompressed) ground-truth images. See Fig. (a)a for examples (Ground-truths can be found in Fig. 17).

(a) Results of ARCNN
(b) Our results
Figure 3: Compression artifacts reduction examples. Compared with ARCNN, our results have much richer textures; e.g., see the fur in Row 1 and the bushes in Row 2.

Taking an image as input, the JPEG encoder first divides it into non-overlapped coding blocks. After that, discrete cosine transform (DCT) is applied on each block, and the resultant DCT coefficients are uniformly quantified according to the JPEG quantization table. For decoding, the JPEG decoder performs inverse DCT on the quantified coefficients to obtain pixel values. It can be seen that information loss and compression artifacts are all due to quantization. Most learning-based approaches, including the aforementioned state of the arts (e.g., ARCNN, D3, and DDCN), just learn a one-to-one mapping between JPEG-compressed images and the corresponding ground-truths. Such designs have a drawback. Due to the many-to-one property of quantization, there are indeed multiple ground-truths for a compressed image. Owing to the subjective preference of images from human visual system, different people may favor different ground-truths. Therefore, it is better to develop a one-to-many mapping for recovering artifact-free images, which provide users a series of high-quality candidates, and let the users pick what they like.

Nevertheless, measuring the output quality is a difficult task. Most existing approaches like ARCNN adopted a per-pixel loss, since it is straight forward and can encourage finding solutions whose pixel values are closed to ground-truths. Unfortunately, the loss is a convex function, so there is only one optimal solution given an input. This is contrary to the many-to-one property of quantization, and will lead to incorrect results. Consider a toy example in which different gray levels, say and , are all quantified to . Now we are going to recover from . If we learn a mapping using the loss, no matter how we model the mapping, at last we will find one unique solution and map to this specific value. As the mapping is trained to minimize the error averaged on a dataset, this solution will tend to be the ground-truth average (e.g., ), which is not any ground-truth obviously. Looking back to compression artifacts reduction, now it is also clear where those overly smooth outputs of existing approaches come from: Since each compressed image is deterministically mapped to the ground-truth average because of the per-pixel loss, lots of details cancel each other out during averaging, resulting in blurred regions everywhere.

What’s worse, the per-pixel loss doesn’t well describe the perceptual differences between images. For instance, if we shift any of two identical images by one pixel, these two images are still perceptually similar, although they would be quite different as measured by the per-pixel loss. Recent works discovered that perceptual similarity can be better described by differences between high-level image features extracted from pretrained CNNs. This technique has been applied in feature visualization 

[39], feature inversion [32], style transfer [11, 10], etc., and has succeeded in recovering semantic structures in comparison to the per-pixel loss. Nevertheless, high-level features are generally invariant to low-level details, so results from this technique usually consist of visible distortions and insufficient textures.

On the other side, Generative Adversarial Networks (GANs) [12] have been demonstrated to be promising in producing fine details [12, 7, 35]

. This interesting technique is usually employed to generate images that seems natural, with the naturalness measured by a binary classifier. The intuition is, if generated images are hard to be distinguished from natural images, then they should be “natural” enough for humans. However, although significant improvements have been introduced 

[7, 35], GANs still have difficulty in generating visually pleasing semantic structures.

In this work, we combine the benefits of these two techniques within a one-to-many mapping. More specifically, we propose a one-to-many network, which is decomposed into two components – a proposal component and a measurement component, as shown in Fig. 4. The proposal component takes a JPEG compressed image as input, and then outputs a series of artifact-free candidates. The measurement component estimates the output quality. We adopt a perceptual loss that depends on high-level features from a pretrained VGGNet [40]

to estimate the perceptual quality of candidates. In addition, we train a discriminative network to measure the candidate naturalness, which becomes our second loss. Meanwhile, we notice that using these two losses is still not sufficient for good results. Both of them don’t respect image characteristics like color distribution. As a result, outputs from the proposal component often contain unwanted noises and have different contrast when compared with inputs or ground-truths. To resolve this issue, we further introduce a JPEG loss using the JPEG quantization table as a prior, to regularize the DCT coefficient range of outputs. Besides, we find that, when combined with highly non-convex loss functions, deconvolution usually leads to grid-like artifacts. We propose a “shift-and-average” strategy to handle this problem. Experiments prove that our approach is able to generate results favored by humans. Compare Fig. 

(a)a and Fig. (b)b for examples.

Figure 4: An overview of the one-to-many network. Given a JPEG-compressed image, the proposal component generates artifact-free candidates, whose qualities are further evaluated by the measurement component.

2 Related Works

2.1 Compression Artifacts Reduction

Many approaches have been proposed to cope with compression artifacts. Early works utilized carefully chosen filters to suppress blocking and ringing artifacts. For example, Reeve and Lim [36] applied a Gaussian filter to the pixels around coding block boundaries to smooth out blocking artifacts. Besides, Chen et al[3] employed a low-pass filter on the DCT coefficients of adjacent coding blocks for deblocking. However, it is unlikely that such manual designs can sufficiently model compression degradation.

Recently, learning-based approaches are gradually becoming the first choice. One of the representative approaches is sparse coding [22, 5, 2, 29, 28, 37]. In general, these approaches first encode an input image by a compressed-image dictionary, and then pass the sparse coefficients into an uncompressed-image dictionary for reconstruction. Sparse-coding-based approaches are usually inefficient, as they require a complicated optimization procedure. What’s worse, it is quite difficult to employ end-to-end training. Thus, their performances are limited.

Neural networks date back decades [26]. Nowadays, deep neural networks, especially deep CNNs, have shown explosive successes in both high-level [25, 40, 34] and low-level [9, 8, 45, 15] vision problems. ARCNN [8] demonstrated the great potential of end-to-end trained CNNs in removing various compression artifacts. DDCN [15] pointed out that the -layer ARCNN was insufficient to eliminate complex artifacts and thus proposed a much deeper (20-layer) architecture. D3 [45] converted sparse-coding approaches into a LISTA-based [14] deep neural network, and obtained speed and performance gains. Both of DDCN and D3 adopted JPEG-related priors to improve reconstruction quality.

2.2 Perceptual Loss

A number of recent works used a perceptual loss, which is defined on high-level features extracted from pretrained CNNs, as the optimization target. Mahendran and Vedaldi [32]

inverted features from convolutional networks by minimizing the feature reconstruction loss, in order to analyze the visual information captured by different network layers. Similar optimization targets have been used in feature visualization 

[39], artistic style transfer [11, 10], and so on. In particular, Johnson et al[21]

trained a feed-forward network to solve the optimization problem, largely reducing the computational cost. This work is particular relevant to ours, as they have showed impressive results in image super-resolution by replacing the per-pixel

loss with the perceptual loss during the training of a CNN. However, as discussed in the Introduction, only minimizing the perceptual loss usually leads to unsatisfactory details.

2.3 Generative Adversarial Networks

Starting from Goodfellow et al[12] who introduced GANs for generating digits, GANs have attracted significant attention in the area of image generation. In general, a GAN contains a generative network and a discriminative network. The discriminative network is trained to determine whether an image is from reality or the generative network. And the generative network is trained to improve its outputs so that they are good enough and cannot be easily distinguished from reality. Training GANs is tricky and unstable. Denton et al[7] built a Laplacian pyramid of GANs (LAPGAN) to generate natural images in a coarse to fine scheme. In addition, DCGAN [35] proposed some good practices for training GANs. More applications of GANs can be found in [33, 19, 43]. Concurrent with our work, Ledig et al[27] also combine a VGGNet-based perceptual loss and GANs for image restoration, and have achieved impressive results.

3 One-to-Many Network

3.1 Formulation

Consider a JPEG-compressed image . Our goal is to recover from a series of artifact-free images which are as similar as possible to the uncompressed ground-truth . Note that here we only consider one ground-truth for each input. Although a compressed image may come from numerous uncompressed images, in practice, due to the available data, at most time we can only access one ground-truth for a compressed image. Nevertheless, our discussion can be easily extended to multiple ground-truths.

3.2 Proposal Component

Our one-to-many network contains two major components. In this sub-section we describe the proposal component, which provides a model for . More specifically, we develop the mapping as a deep CNN. To enable the one-to-many property, within the network we introduce an auxiliary variable as a hidden additional input. The network takes a compressed image as input; at the same time it samples a

from a zero-centered normal distribution with standard deviation

. Both of and are then feed into the network for non-linear mapping. As JPEG compression is not optimal, redundant information neglected by the JPEG encoder may still be found in a compressed image. A deep CNN on can effectively discover and utilize such information to recover details eliminated by quantization. The sampled adds randomness to the network, encouraging it to explore and generate distinct artifact-free candidates.

3.2.1 Network Structure

The proposal component roughly follows the network structure set forth by [43] and [21]. At first the compressed image and the sampled are given as inputs of two different branches. After that the outputs of these two branches are concatenated. On top of the concatenated feature maps, an aggregation sub-network is further performed to generate an artifact-free prediction. Fig. 5 illustrates this component.

Figure 5: The architecture of the proposal component. The filter number of the last layer is equal to the channel number of inputs. The other convolutional layers contain filters.

Very recently, skip connection, especially the identity shortcut, has become very popular in building deep neural networks. He et al.’s deep residual network (ResNet) [16] which consists of many stacked residual units has shown state-of-the-art accuracy for several challenging recognition tasks. Our work follows their wisdom. In the proposal component, each branch contains residual units, and the aggregation sub-network comprises residual units. For the residual unit we adopt the variant proposed in He et al.’s later work [18]

. More specifically, each residual unit includes two Batch Normalization 

[20]

layers, two ReLU 

[25] layers, and two convolutional layers.

Before a compressed image is forwarded to the network, it is down-sampled by a stride-

convolutional layer. And finally the network output is up-sampled by a stride- deconvolutional layer to maintain the image size. There are two benefits to networks that down-sample and then up-sample. First, due to the reduced input resolution, the computational cost is much lower (only compared with the no down-sampling version). Second, with a same filter size and a same layer number, x down-sampling can increase the effective receptive field size by , which is advantageous for integrating large-area spatial information.

3.2.2 Up-Sampling

There is no free lunch. Although down-sampling has several benefits, up-sampling is not as trivial as it first looks. Let us consider a -D example, using a stride- deconvolutional layer with filter size for up-sampling. Denote the filter as . Now assume we apply deconvolution on a constant input , where is a scalar. The expected output should be constant as well. However, the actual output will be . If we require the actual output to meet the expected output, then the trained filter should satisfy . This constraint may be implicitly learned if we use a per-pixel loss on top of the deconvolutional layer, because the loss is simple enough so that learning the filter weights is nearly a convex optimization problem. But if a highly non-convex loss is adopted, we find that a network would struggle in learning this requirement, resulting in apparent grid-like artifacts. Indeed, this kind of artifacts can be seen in results of many previous works which combined deconvolution with a complicated loss function (e.g., see Fig. 8 in [21]).

Note that using a different filter size won’t resolve this issue (in fact, it may be worse if the filter size is odd). Instead, in this work we propose a simple strategy namely “shift-and-average”. Continuing the previous example, after we obtain the deconvolution output (denoted as

), the following two steps are performed:

  1. Duplicate and shift it right by pixel.

  2. Average and the shifted version.

Figure 6: An illustration for the shift-and-average strategy.

Fig. 6 provides an illustration. We can see that a constant input will result in a constant output, which is expected. This strategy can be easily extended to -D data. For -D stride- deconvolution, shifts (i.e., shift right, shift down, shift right and down) are performed in Step 1. In general, for stride- -D deconvolution, shifts are required. Nevertheless, both steps in the proposed strategy can be efficiently parallelized, and thus run extremely fast on GPUs.

3.3 Measurement Component

After we obtain an output from the proposal component, the measurement component is adopted to estimate whether is favored by humans. We define three loss functions for measurement.

3.3.1 Perceptual Loss

The perceptual loss estimates semantic similarity between and . Previous works [10, 11, 21] found that features from deep networks pretrained for image classification can well describe perceptual information. Especially, Mahendran and Vedaldi [32] showed that, features extracted from lower layers tend to retain photographically accurate information, whereas higher-layer features are more invariant to color, texture, and shape differences. Thus, rather than advocating per-pixel matching, the perceptual loss is defined to encourage and to share similar high-layer features:

(1)

where is features computed from a network, and is the feature size. In this work, we employ the activations of the last convolutional layer of the 16-layer VGGNet [40], i.e., Layer “relu5_4”, as .

3.3.2 Naturalness Loss

The perceptual loss has high degree of geometric and photometric invariance. This is good for reconstructing semantic structures, but it also has some drawbacks. Consider a natural image and a moderately smoothed version. Minimizing the perceptual loss won’t strongly favor the un-smoothed version, as normally those smoothed-out details don’t have a great influence on semantic discrimination. However, we wish to recover artifact-free images as “natural” as possible.

We introduce another loss to resolve this issue, following the spirit of GANs. We build an additional network to distinguish whether an image is generated from the proposal component or it is a natural image. Network

performs binary classification and outputs the probability of an input being “natural”. We add (the negative log of) this probability on

as the second loss for the measurement component, encouraging to have a high probability:

(2)

Network needs to be trained as well. We adopt the binary entropy loss as its optimization target:

(3)

As can be seen from Eq. (2) and Eq. (3), network and network are competing against each other: network tries to generate an artifact-free image which is hard for network to differentiate from natural images, while is trained to avoid getting fooled by .

DCGAN Network
Conv Unit 1 stride-1 conv (64)
stride-2 conv (64) stride-2 conv (64)
Conv Unit 2 stride-1 conv (128)
stride-2 conv (128) stride-2 conv (128)
Conv Unit 3 stride-1 conv (256)
stride-2 conv (256) stride-2 conv (256)
Conv Unit 4 stride-1 conv (512)
stride-2 conv (512) stride-2 conv (512)
Classifier Logistic Regression
Table 1: Comparison of DCGAN and our network . “conv” is short for convolution. Each convolutional layer except the last is followed by a Batch Normalization and a Leaky ReLU, sequentially. The filter size is always . The filter numbers are shown right after “conv”.

For the structure of network , we generally follow the architectural guidelines proposed by DCGAN [35], with the network depth doubled. More specifically, we also adopt convolutional units, but each unit is composed of convolutional layers instead of . Every convolutional layer except the last is followed by a Batch Normalization and a Leaky ReLU [31]. The outputs of the last convolutional unit are feed into a Logistic regression classifier. We despite the differences of our network and DCGAN in Table 1.111In the first version of this paper, Table 1 contained a 512-node fully-connected layer between “Logistic Regression” and the last convolutional layer. Our previous intension was to indicate that the classifier took 512-channel feature maps as its input, and at that time “Logistic Regression” in the table just meant a sigmoid transform (not the classifier). Unfortunately, this is quite misleading. Hence, in this version we revise this part.

3.3.3 JPEG Loss

Intuitively, if we adjust the contrast of an image, little semantic information will change. That is, the perceptual loss is insensitive to the color distribution of a reconstructed image . Besides, Eq. (2) shows that the naturalness loss doesn’t concern whether the color distributions of and the input match, either. Oppositely, for the purpose of compression artifacts reduction, we hope the color distribution of an input can be roughly retained. Hence, we introduce an additional JPEG-related loss to enforce this constraint.

Estimating the real color distribution is rather difficult. Fortunately, the JPEG standard is composed of various pre-defined parameters. By wisely leveraging these parameters, at least we can obtain the lower bound and the upper bound of pixel values. As aforementioned, for compression, the JPEG encoder divides the DCT coefficients of an input image by a quantization table, and then rounds the results to the nearest integer. The JPEG decoder performs decompression by multiplying back the quantization table. Thus, the relations between a compressed image and the corresponding uncompressed image can be formulated as:

(4)

where and are the DCT coefficients of and , respectively. is the quantization table. and are indices in the DCT domain. Eq. (4) implies the following DCT coefficient range constraint:

(5)

So each recovered artifact-free image should satisfy Eq. (5) as well. We propose the following JPEG loss:

(6)

where is the size of . As can be seen, the JPEG loss is a truncated loss. A reconstructed image whose DCT coefficients fall outside the lower / upper bound (i.e., ) will be penalized.

3.4 Joint Training for the One-to-Many Network

We merge all the aforementioned loss functions to build the measurement component:

(7)

In this paper, we always set to . needs some special treatments. Note that the JPEG encoder performs quantization on each non-overlapped coding block individually. For a patch misaligned with the coding block boundaries, we cannot obtain its DCT coefficients. Hence, we set different values according to the given patches. Generally speaking, the training procedure for our one-to-many network is similar to the original GAN [12], which includes two main steps in each iteration:

  1. Fix the proposal component , optimize the discriminative network with Eq. (3).

  2. Fix network , optimize the proposal component with the measurement component (i.e., Eq. (7)). If an input patch is aligned with the JPEG coding block boundaries, set to ; otherwise set to .

In the first epoch of training we only perform the second step without network

, i.e., network is not trained nor used in the first epoch. The reason is, at the beginning the generated images are not good, so even a trivial network can distinguish them from natural images. Feeding them to network is just a waste of computational resources.

4 Experiments

In this section, we conduct experiments to demonstrate the effectiveness of the proposed one-to-many network.

Dataset.

In all experiments, we employ the ImageNet dataset 

[6] for training. The validation set of the BSDS500 dataset [1] is used for validation. Following the standard protocol of previous approaches, the MATLAB JPEG encoder is applied to generate JPEG-compressed images. Nevertheless, other JPEG encoders are generally acceptable as we didn’t notice visible differences in resultant images.

Parameter Settings.

We roughly follow the parameter settings in DCGAN [35]. We train our one-to-many network for epochs, using Adam [23] with learning rate and momentum term . The batch size is set to .222In the first version of the paper, we wrote down the learning rate as and the batch size as . These were somewhat ambiguous. In fact, we trained the network on two GPUs, with each GPU handling a batch of images. We also rescaled gradients in proportion to the batch size, for ease of hyper-parameter tuning [24]. Taken them into consideration, the effective learning rate and batch size should be and respectively when the network is trained on a single GPU without gradient rescaling. For the Leaky ReLU, the slope of the leak is set to . Training images are prepared as same-size patches for network inputs. All weights are initialized using He et al.’s uniform initializer [17]. During testing, the proposal component runs as a fully convolutional network [30] to generate full-image predictions.

4.1 Baseline Evaluations

The traditional metrics used to evaluate compression artifacts reduction are PSNR, SSIM [44], and PSNR-B [46]. All of them rely on low-level differences between pixels. Especially, PSNR is equivalent to the per-pixel loss. So when measured by PSNR, a model trained to minimize the per-pixel error should always outperform a model that minimizes Eq. (7). For fair comparison, in this experiment we replace the measurement component by the per-pixel loss. We also drop the auxiliary variable in the proposal component as there is only optimal solution under the loss. We name this variant as “baseline” in the following.

We compare our baseline with two latest compression artifacts reduction approaches, i.e., ARCNN, and DDCN, on the test set of the BSDS500 dataset [1]. We also include the latest generic image restoration framework TNRD [4] for comparison. D3 is not examined here as so far there is no open code or model for evaluation. Three JPEG qualities are evaluated: , , and . All experiments in this section are conducted on the luminance channel (in YCbCr color space), according to the protocol of previous approaches.

Table 2

shows the quantitative results. On the whole, our baseline largely outperforms ARCNN, and TNRD on all JPEG qualities and evaluation metrics, and is on par with DDCN. In particular, our baseline performs best for Quality 5, indicating it is especially suitable for low-quality inputs, which have higher demand for good reconstruction. We emphasize that our goal of this paper is not to achieve the best PSNR / SSIM / PSNR-B results, but instead to improve the human favorability of recovered artifact-free images. Thus, we didn’t add techniques like the DCT-domain priors used in D3 and DDCN to further improve PSNR.

Quality Approach PSNR (dB) SSIM PSNR-B (dB)
5 JPEG 25.36 0.6764 22.91
ARCNN 26.72 0.7256 26.48
TNRD 26.81 0.7279 26.65
DDCN 26.98 0.7333 26.76
Baseline 27.12 0.7406 26.87

10
JPEG 27.80 0.7875 25.10
ARCNN 29.10 0.8198 28.73
TNRD 29.16 0.8225 28.81
DDCN 29.59 0.8381 29.18
Baseline 29.56 0.8352 29.10

20
JPEG 30.05 0.8671 27.22
ARCNN 31.28 0.8854 30.55
TNRD 31.41 0.8889 30.83
DDCN 31.88 0.8996 31.10
Baseline 31.89 0.8977 31.04
Table 2: Comparisons with the State of the Arts on the BSDS500 Dataset. Red color indicates the best performance; Blue color indicates the second best performances.
(a) Ground-truth / PSNR
(b) JPEG /
(c) DDCN /
(d) Baseline /
(e) One-to-Many /
(f) Ground-truth / PSNR
(g) JPEG /
(h) DDCN /
(i) Baseline /
(j) One-to-Many /
Figure 17: Comparisons under Quality on BSDS500. Row 1: Image 6046; Row 2: Image 344010. Best view on screen.

In Cols. 34 of Fig. 17, we present restored images from DDCN and our baseline for qualitative evaluations. In general, we can see that, both approaches tend to produce overly smooth results. The reconstructed images lack fine details and rich textures when compared with the ground-truths. Although our baseline has outperformed existing approaches on PSNR, its visual quality is still far from satisfactory.

4.2 Favorability Evaluations

In this section we evaluate the human favorability of recovered images. Unfortunately, currently there is no objective metric to measure the favorability. Hence, we conduct qualitative experiments for evaluations. As has been discussed, the many-to-one quantization step makes recovery highly ill-posed. Ambiguity becomes more extreme as input quality reduces. For low qualities, most high-frequency information of the original image is eliminated. So in the following experiments we focus on Quality as lower qualities require more effective recovery of details. We train our one-to-many network according to Section 3.4.

The results of our one-to-many network for Image 6045 and Image 344010 are shown in the last column of Fig. 17. By comparing them with the results of existing approaches and our baseline, we can see that our one-to-many network does a pretty good job at recovering edges and details. The textures are much richer in our approach, such as the fur and the rocks in Image 6046, and the bushes in Image 344010, etc. Note that our one-to-many network does not add details indiscriminately. For example, in Image 6046, our approach largely enriches the fur of the donkey, but the background sky remains clean, suggesting that the one-to-many network is aware of image semantics.

We also perform evaluations on the Set14 dataset [47]. Here, for each compressed image, we sample two different s to obtain two reconstructed candidates, and show them in the last two columns of Fig. 28. As can be seen, these candidates have different details, and all of them look more vivid than the results of our baseline. Interestingly, from the perspective of PSNR, in both Fig. 17 and Fig. 28 our one-to-many network performs even worse than JPEG-compressed input images, whereas it is obvious that our results are more visually pleasing. This demonstrates that PSNR is not sufficient for quality measurement.

(a) Ground-truth / PSNR
(b) JPEG /
(c) Baseline /
(d) One-to-Many (1) /
(e) One-to-Many (2) /
(f) Ground-truth / PSNR
(g) JPEG /
(h) Baseline /
(i) One-to-Many (1) /
(j) One-to-Many (2) /
Figure 28: Comparison under Quality on Set14. Row 1: Image “baboon”; Row 2: Image “coastguard”. Best view on screen.
(a)
(b)
(c)
(d)
(e)
(f) Ground-truth
(g) JPEG
(h) DDCN
(i) Baseline
(j) One-to-Many
Figure 39: Comparison under Quality on BSDS500. Row 1: Image 100039; Row 2: Image 108004. Best view on screen.

4.3 Extension to Color Images

Our approach is not limited to gray images. We re-train our one-to-many network on RGB images and show results in Fig. 39. As can be observed, our approach has produced much finer details in comparison to DDCN or our baseline.

4.4 Further Analysis

Up-sample

We conduct another experiment to examine the proposed shift-and-average strategy. By comparing Fig. (a)a and Fig. (b)b, we can observe that, without the proposed strategy, grid-like artifacts are visible over the whole image, suggesting that they are results of the composition of a traditional deconvolution operation and a highly non-convex loss function. The proposed strategy is able to suppress such artifacts without hurting the perceptual quality.

(a)
(b)
(c)
Figure 43: (a) Train using without the shift-and-average strategy; (b) Train using with the shift-and-average strategy; (c) Train using and .
Loss

Fig. 43 also exhibits the influences of each loss. Fig. (b)b shows that using for measurement is able to recover primary semantic information. After adding , fine details are supplemented, as indicated in Fig. (c)c. However, the contrast of the resulting image is significantly different from the input or ground-truth (both can be found in Fig. 17). Once we put in , the contrast is adjusted (see Row 1, Column 5 of Fig. 17). More interestingly, it seems that additional details appear as well. One explanation is, high-contrast natural images usually have more complex textures than low-contrast images, so will encourage the network to synthesize more details after the contrast is enhanced by . These experiments demonstrate the importance of all proposed losses.

5 Conclusion

In this paper, we systematically studied how to effectively recover artifact-free images from JPEG-compressed images. As a natural inversion of the many-to-one JPEG compression, we propose a one-to-many network. The proposed model, when optimized with a perceptual loss, a naturalness loss, and a JPEG loss, could reconstruct multiple artifact-free candidates that are more favored by humans, and thus drastically improved the recovery quality.

One limitation of our approach is the scalability to various JPEG qualities. Currently we need to train an individual model for each quality. Besides, how to objectively evaluate the favorablity of output images remains to be a problem. We hope to address these issues in the future.

Acknowledgment

This work is supported by NSF of China under Grant U1611461, 61672548, and the Guangzhou Science and Technology Program, China, under Grant 201510010165.

References