## Authors

• 70 publications
• 3 publications
• 19 publications
• ### Introducing the Perception-Distortion Tradeoff into the Rate-Distortion Theory of General Information Sources

Blau and Michaeli recently introduced a novel concept for inverse proble...
08/24/2018 ∙ by Ryutaroh Matsumoto, et al. ∙ 0

Image restoration algorithms are typically evaluated by some distortion ...
11/16/2017 ∙ by Yochai Blau, et al. ∙ 0

• ### Rethinking Lossy Compression: The Rate-Distortion-Perception Tradeoff

Lossy compression algorithms are typically designed and analyzed through...
01/23/2019 ∙ by Yochai Blau, et al. ∙ 0

• ### A Theory of the Distortion-Perception Tradeoff in Wasserstein Space

The lower the distortion of an estimator, the more the distribution of i...
07/06/2021 ∙ by Dror Freirich, et al. ∙ 0

• ### JPAD-SE: High-Level Semantics for Joint Perception-Accuracy-Distortion Enhancement in Image Compression

While humans can effortlessly transform complex visual scenes into simpl...
05/24/2020 ∙ by Shiyu Duan, et al. ∙ 3

• ### On Perceptual Lossy Compression: The Cost of Perceptual Reconstruction and An Optimal Training Framework

Lossy compression algorithms are typically designed to achieve the lowes...
06/05/2021 ∙ by Zeyu Yan, et al. ∙ 0

• ### Deep Residual Echo Suppression with A Tunable Tradeoff Between Signal Distortion and Echo Suppression

In this paper, we propose a residual echo suppression method using a UNe...
06/25/2021 ∙ by Amir Ivry, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Motivation

Signal degradation refers to the corruption of the signal due to many different reasons such as interference and the blend of interested signal and uninterested signal or noise, which is observed ubiquitously in practical information systems. The cause of signal degradation may be physical factors, such as the imperfectness of data acquisition devices and the noise in data transmission medium; or may be artificial factors, such as the lossy data compression and the transmission of multiple sources over the same medium at the same time. In addition, in cases where we want to enhance signal, we may assume the signal to have been somehow “degraded,” for example as we want to enhance the resolution of an image, we assume the image is a degraded version of an ideal “original” image that has high resolution [1].

To tackle signal degradation or to fulfill signal enhancement, computational restoration of degraded signal has been investigated for many years. There are various signal restoration tasks corresponding to different degradation reasons. Taken image as example, image denoising [2], image deblur [3]

, single image super-resolution

[1], image contrast enhancement [4], image compression artifact removal [5][6], …, all belong to image restoration tasks.

Different restoration tasks have various objectives. Some tasks may be keen to recover the “original” signal as faithfully as possible, like image denoising is to recover the noise-free image, compression artifact removal is to recover the uncompressed image. Some other tasks may concern more about the perceptual quality of the restored signal, like image super-resolution is to produce image details to make the enhanced image look like “high-resolution,” image inpainting is to generate a complete image that looks “natural.” Yet some other tasks may serve for recognition or understanding purpose: for one example, an image containing a car license plate may have blur, and image deblur can achieve a less blurred image so as recognize the license plate [7]; for another example, an image taken at night is difficult to identify, and image contrast enhancement can produce a more naturally looking image that is better understood [8]. Recent years have witnessed more and more efforts about the last category [9, 10].

Given the different objectives, it is apparent that a signal restoration method designed for one specific task shall be evaluated with the specific metric that corresponds to the task’s objective. Indeed, the aforementioned objectives correspond to three groups of evaluation metrics:

1. Signal fidelity metrics that evaluate how similar is the restored signal to the “original” signal. These include all the full-reference quality metrics, such as the well-known mean-squared-error (MSE) and its counterpart peak signal-to-noise ratio (PSNR), the structural similarity (SSIM) [11]

, and the difference in features extracted from original signal and restored signal

[12], to name a few.

2. Perceptual naturalness metrics that evaluate how “natural” is the restored signal with respect to human perception. Perceptual naturalness was evaluated by human and approximated by no-reference quality assessment methods [13, 14]. Recently, the popularity of generative adversarial network (GAN) has motivated a formulation of perceptual naturalness [15].

3. Semantic quality metrics that evaluate how “useful” is the restored signal in the sense that it better serves for the following semantic-related analyses. For example, how well a classifier performs on the restored signal is a measure of the semantic quality. There are only a few studies about semantic quality assessment methods [16].

It is worth noting that signal fidelity metrics have dominated in the researches of signal restoration. However, is one method optimized for signal fidelity also optimal for perceptual naturalness or semantic quality? This question has been overlooked for a long while until recently. Blau and Michaeli considered signal fidelity and perceptual naturalness and concluded that both metrics cannot be optimized simultaneously [15]. Indeed, they provided a rigorous proof of the existence of the perception-distortion tradeoff: with distortion representing signal fidelity and perceptual difference representing perceptual naturalness, one signal restoration method cannot achieve both low distortion and low perceptual difference (up to a bound). This conclusion reveals the fundamental limit of the capability of signal restoration, and inspires the adoption of perceptual naturalness metrics in related tasks [17, 18].

Following the work of the perception-distortion tradeoff, in this paper, we aim to consider the three groups of metrics jointly, i.e. we want to study the relation between signal fidelity, perceptual naturalness, and semantic quality. We consider classification error rate as the representative of semantic quality, because classification is the most fundamental semantic-related analysis. We find there is indeed a tradeoff between the three metrics, which is named the classification-distortion-perception (CDP) tradeoff. In short, the CDP tradeoff claims that the distortion, perceptual difference, and classification error rate cannot be made minimal simultaneously. Our proof indicates the essential difference between the three quality metrics. In practice, it implies the adoption of semantic quality metrics instead of signal fidelity or perceptual naturalness metrics, if a signal restoration method is meant to serve for recognition purpose.

### 1.2 Problem Definition

Consider the process: , where denotes the ideal “original” signal, denotes the degraded signal, and denotes the restored signal. We formulate , , and

each as a discrete random variable. The cases of continuous random variables can be deduced in a similar manner, and thus are omitted hereafter. The probability mass function of

is denoted by . The degradation model is denoted by , which is characterized by a conditional mass function . The restoration method is then denoted by and characterized by .

We are interested in classifying the signal into two categories in this paper. Thus, we assume each sample of the original signal belongs to one of two classes: or . The a priori probabilities and the conditional mass functions are assumed to be known as and , respectively. In other words, follows a two-component mixture model: . Accordingly, follows the model: , and follows the model: , where

 pYi(y) = ∑x∈Xp(y|x)pXi(x),i=1,2 (1) p^Xi(^x) = ∑y∈Yp(^x|y)pYi(y) (2) = ∑y∑xp(^x|y)p(y|x)pXi(x),i=1,2

A binary classifier can be denoted by

 c(t)=c(t|R)={ω1,if t∈Rω2,otherwise (3)

If we apply this classifier on the original signal , we shall achieve an error rate

 ε(X|c)=ε(X|R)=P2∑x∈RpX2(x)+P1∑x∉RpX1(x) (4)

The optimal classifier is defined as the classifier that achieves the minimal error rate for a given signal, e.g. . According to the Bayes decision rule (see [19] for proof), the optimal classifier shall be

 c∗X=c(⋅|R∗X),where R∗X={x|P1pX1(x)≥P2pX2(x)} (5)

which leads to the minimal error rate, a.k.a. the Bayes error rate

 ϵ(X) =mincε(X|c)=ε(X|R∗X) (6) =∑xmin[P1pX1(x),P2pX2(x)] =12−12∑x|P1pX1(x)−P2pX2(x)|

### 1.3 Main Theorems

We prove two versions of the CDP tradeoff. For the first version, we consider using a predefined classifier on the restored signal. This leads to

###### Definition 1.

The classification-distortion-perception (CDP) function is

 C(D,P)=minP^X|Yε(^X|c0),subject to E[Δ(X,^X)]≤D,d(pX,p^X)≤P (7)

where is to take expectation, is a function to measure distortion between the original and the restored signals, and is a function to measure the difference between two probability mass functions, which is claimed to be indicative for perceptual difference [15].

###### Theorem 1.

Consider (7), if is convex in , then is

1. monotonically non-increasing,

2. convex in and .

Note that the convexity of the perceptual difference is assumed, which is claimed to be satisfied by a large number of commonly used difference functions, including any f-divergence (e.g. Kullback-Leibler divergence, total variation, Hellinger) and the Rényi divergence

[20, 21].

For the second version, we consider using the optimal classifier on the restored signal, i.e. the classifier is adaptive to the restored signal. According to the Bayes decision rule, we are actually considering the Bayes error rate of . This leads to

###### Definition 2.

The strong classification-distortion-perception (SCDP) function is

 CS(D,P)=minP^X|Yϵ(^X),subject to E[Δ(X,^X)]≤D,d(pX,p^X)≤P (8)
###### Theorem 2.

Consider (8), is monotonically non-increasing.

### 1.4 Paper Organization

In the following sections, we first give some properties of the classification error rate, especially the Bayes error rate, which will be helpful in our proofs of the main theorems. Then we prove the two theorems one by one. Discussion and conclusion are finally presented.

## 2 Properties of the Classification Error Rate

### 2.1 Classification Error Rate is Linear

###### Theorem 3.

Let follow a two-component mixture model: , similarly follow: . Let be the random variable with where . Let be a fixed classifier, then

 ε(W|c0)=λε(U|c0)+(1−λ)ε(V|c0) (9)
###### Proof.

As is a fixed classifier, it can be denoted in general by . Then we have

 ε(U|c0) = P2∑u∈R0pU2(u)+P1∑u∉R0pU1(u) (10) ε(V|c0) = P2∑v∈R0pV2(v)+P1∑v∉R0pV1(v) (11)

Thus

 ε(W|c0) =P2∑w∈R0pW2(w)+P1∑w∉R0pW1(w) (12) =P2∑w∈R0[λpU2(w)+(1−λ)pV2(w)]+P1∑w∉R0[λpU1(w)+(1−λ)pV1(w)] =λ⎡⎣P2∑w∈R0pU2(w)+P1∑w∉R0pU1(w)⎤⎦+(1−λ)⎡⎣P2∑w∈R0pV2(w)+P1∑w∉R0pV1(w)⎤⎦ =λε(U|c0)+(1−λ)ε(V|c0)

### 2.2 Bayes Error Rate is Concave

###### Theorem 4.

Let , , and be defined as in Theorem 3, then

 ϵ(W)≥λϵ(U)+(1−λ)ϵ(V) (13)
###### Proof.

denotes the optimal classifier for , then . According to (9) we have . Note that and . Thus . ∎

### 2.3 Bayes Error Rate is Non-Decreasing

###### Theorem 5.

Let the process of be denoted by , which is characterized by a conditional mass function , then and if and only if satisfies: , where , and . Note that is slightly different from defined in (5).

###### Proof.
 ϵY =∑ymin[P1pY1(y),P2pY2(y)] (14) =12−12∑y|P1pY1(y)−P2pY2(y)| =12−12∑y∣∣∣P1∑xp(y|x)pX1(x)−P2∑xp(y|x)pX2(x)∣∣∣ =12−12∑y∣∣∣∑xp(y|x)[P1pX1(x)−P2pX2(x)]∣∣∣ ≥12−12∑y∑xp(y|x)|P1pX1(x)−P2pX2(x)| =12−12∑x|P1pX1(x)−P2pX2(x)|∑yp(y|x) =12−12∑x|P1pX1(x)−P2pX2(x)|=ϵX

When , for any , we need to have

 ∣∣∣∑xp(y|x)[P1pX1(x)−P2pX2(x)]∣∣∣=∑xp(y|x)|P1pX1(x)−P2pX2(x)| (15)

which is equivalent to: all the ’s that satisfy shall have either or . The condition is further equivalent to: the ’s that satisfy shall be either all in , or all in , where . In other words, . ∎

We can compare Theorem 5 with the data processing theorem in the information theory: consider the process of as a deterministic function , then , and if and only if is invertible [22]. That says, the information quantity we have about the source is non-increasing after data processing. Similarly, Theorem 5 claims that the Bayes error rate is non-decreasing after data processing, because we lose information, at best not. Moreover, not only invertible function satisfies the condition required in Theorem 5, but also a large group of non-invertible functions as well as probabilistic mappings satisfy the condition, which is quite different from the data processing theorem. In other words, we may lose information but that information loss may not affect classification.

## 3 Proof of the CDP Tradeoff (Theorem 1)

###### Proof.

For the first point, simply note that when increasing or , the feasible domain of is enlarged; as is the minimal value of over the feasible domain, and the feasible domain is enlarged, the minimal value will not increase.

For the second point, it is equivalent to prove:

 λC(D1,P1)+(1−λ)C(D2,P2)≥C(λD1+(1−λ)D2,λP1+(1−λ)P2) (16)

for any . First, let (resp. ) denote the optimal restoration method under constraint (resp. ), and (resp. ) be the restored signal, i.e.

 ε(^Xμ|c0)=minP^X|Yε(^X|c0),subject to E[Δ(X,^X)]≤D1,d(pX,p^X)≤P1 (17)
 ε(^Xν|c0)=minP^X|Yε(^X|c0),subject to E[Δ(X,^X)]≤D2,d(pX,p^X)≤P2 (18)

Then the left hand side of (16) becomes

 λε(^Xμ|c0)+(1−λ)ε(^Xν|c0)=ε(^Xλ|c0) (19)

where we have used Theorem 3 and denotes the restored signal corresponding to . Let , , then by definition

 ε(^Xλ|c0)≥C(Dλ,Pλ) (20)

Next, as in (7) is convex in its second argument, we have

 Pλ =d(pX,λp^Xμ+(1−λ)p^Xν) (21) ≤λd(pX,p^Xμ)+(1−λ)d(pX,p^Xν) ≤λP1+(1−λ)P2

the last inequality is due to (17) and (18). Similarly, we have

 Dλ =E[Δ(X,^Xλ)] (22) =EYE[Δ(X,^Xλ)|Y] =EY[λE[Δ(X,^Xμ)|Y]+(1−λ)E[Δ(X,^Xν)|Y]] =λE[Δ(X,^Xμ)]+(1−λ)E[Δ(X,^Xν)] ≤λD1+(1−λ)D2

the last inequality is again due to (17) and (18). Finally, note that is non-increasing with respect to and ,

 C(Dλ,Pλ)≥C(λD1+(1−λ)D2,λP1+(1−λ)P2) (23)

Combining (19), (20), and (23), we have (16). ∎

## 4 Proof of the CDP Tradeoff (Theorem 2)

###### Proof.

Simply note that when increasing or , the feasible domain of is enlarged; as is the minimal value of over the feasible domain, and the feasible domain is enlarged, the minimal value will not increase. ∎

## 5 Discussion and Conclusion

We would like to mention the difference between Theorem 1 and Theorem 2. Theorem 2 is more fundamental as we deal with the theoretically minimal error rate of the restored signal. However in practice, this error rate is not achievable if the degradation model is unknown. Clearly, if is not available, we cannot make any meaningful conclusion regarding the mass function , which prohibits the search for the optimal restoration method together with the optimal classifier. From a practical perspective, we usually adopt a fixed classifier (for example the classifier trained by some samples of the original signal) and adjust the restoration method only. On the other hand, if the degradation model is known, then it is possible to consider the optimal classifier for the degraded signal directly: actually it is better in theory to consider instead of because we have confirmed that (Theorem 5). In other words, signal restoration has no use to improve the classification accuracy as long as the degradation model is known. According to these analyses, Theorem 2 is less appealing in practice.

Note that we do not prove the convexity of the strong CDP function, as we have done for the CDP function in Theorem 1. This is due to the essential difference between classification error rate with a fixed classifier and Bayes error rate: the former is linear and the latter is concave (Theorems 3 and 4). Note that the distortion is also linear but the perceptual difference is convex. We suspect the strong CDP function may be not convex, which is to be confirmed in the future.

Our findings can be useful especially for computer vision researches where some low-level vision tasks (signal restoration) serve for high-level vision tasks (visual understanding). If the degradation model is known, we recommend directly classifying the degraded signal without any restoration; at this time the classifier can be trained by samples that are simulated with the known degradation on the original signal. If the degradation model is unknown, we recommend using a fixed classifier, which can be trained for example using samples of the original signal; meanwhile, we recommend searching for the restoration method with the classification error rate as the objective (or one of the objectives) of optimization. This strategy is clearly different from previous works that optimize for various kinds of distortion metrics for improving the classification performance.

## References

• [1] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in ECCV, 2014, pp. 184–199.
• [2] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
• [3] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O. Wang, “Deep video deblurring for hand-held cameras,” in CVPR, 2017, pp. 1279–1288.
• [4] M. Gharbi, J. Chen, J. T. Barron, S. W. Hasinoff, and F. Durand, “Deep bilateral learning for real-time image enhancement,” ACM Transactions on Graphics, vol. 36, no. 4, p. 118, 2017.
• [5] C. Dong, Y. Deng, C. C. Loy, and X. Tang, “Compression artifacts reduction by a deep convolutional network,” in ICCV, 2015, pp. 576–584.
• [6] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image inpainting with contextual attention,” in CVPR, 2018, pp. 5505–5514.
• [7]

Q. Lu, W. Zhou, L. Fang, and H. Li, “Robust blur kernel estimation for license plate images from fast moving vehicles,”

IEEE Transactions on Image Processing, vol. 25, no. 5, pp. 2311–2323, 2016.
• [8] H. Kuang, X. Zhang, Y.-J. Li, L. L. H. Chan, and H. Yan, “Nighttime vehicle detection based on bio-inspired image enhancement and weighted score-level feature fusion,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 4, pp. 927–936, 2017.
• [9] J. Shermeyer and A. Van Etten, “The effects of super-resolution on object detection performance in satellite imagery,” arXiv preprint arXiv:1812.04098, 2018.
• [10] R. G. VidalMata, S. Banerjee, B. RichardWebster et al., “Bridging the gap between computational photography and visual recognition,” arXiv preprint arXiv:1901.09482, 2019.
• [11] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
• [12] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in ECCV, 2016, pp. 694–711.
• [13] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695–4708, 2012.
• [14] M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the DCT domain,” IEEE Transactions on Image Processing, vol. 21, no. 8, pp. 3339–3352, 2012.
• [15] Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in CVPR, 2018, pp. 6228–6237.
• [16] D. Liu, D. Wang, and H. Li, “Recognizable or not: Towards image semantic quality assessment for compression,” Sensing and Imaging, vol. 18, no. 1, pp. 1–20, 2017.
• [17] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor, “The 2018 PIRM challenge on perceptual image super-resolution,” in ECCV, 2018, pp. 1–22.
• [18]

T. Vu, C. Van Nguyen, T. X. Pham, T. M. Luu, and C. D. Yoo, “Fast and efficient image quality enhancement via desubpixel convolutional neural networks,” in

ECCV, 2018, pp. 1–17.
• [19] K. Fukunaga, Introduction to Statistical Patten Recognition (2nd Edition).   San Diego, CA, USA: Academic Press, 1990, ch. 3.1, pp. 51–65.
• [20] I. Csiszár and P. C. Shields, “Information theory and statistics: A tutorial,” Foundations and Trends® in Communications and Information Theory, vol. 1, no. 4, pp. 417–528, 2004.
• [21] T. Van Erven and P. Harremos, “Rényi divergence and Kullback-Leibler divergence,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 3797–3820, 2014.
• [22] T. M. Cover and J. A. Thomas, Elements of Information Theory.   John Wiley & Sons, 2012.