A Formal Evaluation of PSNR as Quality Measurement Parameter for Image Segmentation Algorithms

by   Fernando A. Fardo, et al.

Quality evaluation of image segmentation algorithms are still subject of debate and research. Currently, there is no generic metric that could be applied to any algorithm reliably. This article contains an evaluation for the PSRN (Peak Signal-To-Noise Ratio) as a metric which has been used to evaluate threshold level selection as well as the number of thresholds in the case of multi-level segmentation. The results obtained in this study suggest that the PSNR is not an adequate quality measurement for segmentation algorithms.



There are no comments yet.


page 2

page 5

page 6


Multilevel Threshold Based Gray Scale Image Segmentation using Cuckoo Search

Image Segmentation is a technique of partitioning the original image int...

Jansen-MIDAS: a multi-level photomicrograph segmentation software based on isotropic undecimated wavelets

Image segmentation, the process of separating the elements within an ima...

Color Image Segmentation Metrics

An automatic image segmentation procedure is an inevitable part of many ...

Automatic Discourse Segmentation: an evaluation in French

In this article, we describe some discursive segmentation methods as wel...

Multi-target detection with rotations

We consider the multi-target detection problem of estimating a two-dimen...

Improving spatial domain based image formation through compressed sensing

In this paper, we improve image reconstruction in a single-pixel scannin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In image processing, segmentation is a a set of techniques that separate regions from a scene based on similarity. There are several techniques available for this process Rodrigues2011 ; Erdmann2015 . Segmentation is usually based on attributes such as color, brightness contrast or continuity of pixel regions. In the particular case of threshold based techniques, one ore more threshold values is determined. Pixels of similar brightness levels are then grouped as below or above such threshold levels gonzalez2002digital .

Fig. 1 shows an example of a scene containing a simple foreground and a background. Fig. 2 shows it’s corresponding gray level histogram with an obtained threshold level at . The resulting image of a threshold based segmentation algorithm can is shown at Fig. 3, where pixels below are set to . Conversely, pixels of brightness level above are set to . In this case, pixels labeled as (0) and (255) can be treated as the background and foreground, respectively.

Figure 1: Example of an image with foreground and background
Figure 2: Gray level histogram with detected threshold
Figure 3: Resulting image after threshold based segmentation with

Such techniques are often used at pre-processing step in high level computer vision based systems as it reduces the amount of irrelevant information by similarity grouping of the pixels in the same region. The objective of threshold algorithms is to detect the threshold level that separates an image in regions of interest more accurately. The main problem is that the quality evaluation of such algorithms lacks an objective parameter and cannot be determined automatically.

There are many proposals for a generic metric of segmentation algorithms. Such metric is often difficult to describe making an objective evaluation method potentially unreliable. The evaluation methods can be divided in two main categories: analytic and empirical cardoso2005toward . The analytic methods are based in properties obtained from the segmented image which can be used in order to obtain a quantitative quality measurement. These methods are not very reliable as determining the quality of a segmentation based purely in analytic parameters can be difficult cardoso2005toward . The empirical methods are based on the comparison of the resulting segmented image with pre-defined desirable results determined by human operators, and can be further divided into two subcategories, goodness methods and discrepancy methods. Goodness methods are uses pre-established parameters such as as region uniformity or inter region contrast. The discrepancy methods rely on the comparison of the segmentation result with a reference image known as ground truth, which is established by an human operator cardoso2005toward .

Despite it’s limitations, the PSNR has been used as an analytic metric by several authors of threshold based algorithms. chen2011gray ; horng2011multilevel ; arora2008multilevel . As subject to study we performed some experiments to verify if PSNR can be used reliably as an analytic metric for image segmentation.

2 Psnr

The PSNR is a signal processing measurement that compares a given received or processed signal to it’s original source signal. This comparison allows us to quantify how much a processed signal is faithful to the original, also allowing us to identify possible noises or distortions to the signal. We can say that the PSNR represents a direct relationship of a signal before and after a degradation process.

Mathematically, the PSRN is described by the Equations (1) and (2)


where is the highest possible value of the signal. In the case of a gray scale image of 8 bits, . As demonstrated in Eq. (1), the is inversely proportional to the MSE (Mean Squared Error). The final value of the is given in decibel.

The PSNR is generally used to evaluate the quality if transmission and compression of image or video signals, based on de mean square error of the received or processed image in comparison to the source image. However, it also has been used as an analytic metric for segmentation algorithm evaluation chen2011gray ; horng2011multilevel . In the case of multi-threshold algorithms, it was also used as a metric to determine the number of thresholds arora2008multilevel as well as it’s values yun2011multi .

3 Objective

The purpose of this paper is to evaluate the PSNR itself as a reliable analytic method for evaluation of image segmentation algorithms.

4 Methodology

Since we are not trying to evaluate an algorithm but the metric itself, we cannot rely on some existing study that used the PSNR as an analytic method for evaluation. Instead, we propose the adoption of ground truth data that would normally be suitable for empiric methods as results of a segmentation algorithm. Then, we use the PSRN as an analytic method to evaluate such results.

For the experiments, we used the set of images from the Berkeley BSR300 Database MartinFTM01 . It comprises of 300 images containing several types of scenes where every image has it’s corresponding ground truth image . The ground truth is an image contained contours of objects from each scene defined by volunteers as the most relevant ones. Fig. 4 shows an example of an image (a) of the database and it’s respective ground truth image (b).

(a) ref1
(b) ref2
Figure 4: Example of an image from the database (a) and it’s respective ground truth (b)

From each ground truth image , a region mask is obtained, separating the background from the foreground. The mask was obtained by automatically filling of the closed contours with the white color (), thus creating masks with the most relevant regions of interest. After applying a threshold algorithm to this mask, a binary mask is obtained. Since computer vision techniques are strongly inspired by the human vision, we can assume that such binary masks are close to an ideal segmentation algorithm. Fig. 5 shows an example of a filled ground truth (a) and the corresponding binary mask (b) after threshold.

(a) ref1
(b) ref2
Figure 5: Automatically filled ground truth image (a) and obtained binary mask (b)

To verify the efficacy of the PSNR as an analytic method for image segmentation, we generated poorly segmented masks based on binary masks with the use of salt and pepper noise. As the salt and pepper noise adds changes pixels randomly to either or we can use this to simulate a bad segmentation. The resulting mask

therefore, contains several pixels that are incorrectly classified as foreground (

) and background (). Fig. 6 shows an example of a binary mask (a) and it’s corresponding bad segmentation (b).

When used as an analytic method, the PSNR is used between the resulting image and the original. Therefore, the PSNR must be calculated between each original image and the corresponding segmentation mask and bad segmentation mask .

For each image in the database, the PSNR is calculated between both and and and the results of the PSNR are calculated and stored for posterior analysis.

(a) ref1
(b) ref2
Figure 6: Binary mask (a) and bad segmentation mask after salt and pepper noise (b)

4.1 Proof

Let be the set of PSNR results calculated between each binary mask and it’s corresponding image . Le be the set of PSNR results calculated between each bad segmentation mask and it’s corresponding source image . If the PSNR is not an adequate analytic method, the average of PSNR values in should be significantly superior to those obtained in . For this paper, this condition is adopted as our main hypothesis.

5 Results and discussion

To confirm the main hypothesis, initially we proposed the use of Sudent’s T test with

of significance student1908probable between and

. However this test requires the variance between the samples to be homogeneous. Firstly we used the Fisher’s F test for variance

fisher1941asymptotic to verify such homogeneity between and . Figs. 7 and 8

shows the density of probability for the sets

and respectively. If the results from the F test indicate that the variance between the sets and is not homogeneous the Student’s T test cannot be applied. In this case, the Welch’s T test should be used instead welch1947generalization . These hypothesis tests were performed using the R language.

Figure 7: Probability density for the set of PSNR results for good segmentation masks
Figure 8: Probability density for the set of PSNR results for bad segmentation masks

5.1 Fisher’s F test for variance

As a null hypothesis for the F test, we adopt that the variances of the sets are homogeneous. As the alternative hypothesis, we adopt that the variances between the sets are not homogeneous. The results from the F test are shown on table


F 0.4618
df 299
df denominator 299
P value
Confidence interval 0.3679506 a 0.5795227
Variance rates
Table 1: Results for the F test of variance between and

The value for the F test is in the region for acceptance of the alternative hypothesis. Therefore, is not safe to assume that the variances between and are homogeneous and the Student’s T test cannot be used reliably. The Welch’s T test is then used to determine if the difference between and is statistically significant.

5.2 Welch’s T test

As a null hypothesis, we adopt that and are equal and the difference between the means of both sets is zero (). As the alternative hypothesis, we adopt that the mean of is superior to the mean of . Should the alternative hypothesis be accepted, it would suggest that the bad segmentation masks were considered better then the ideal segmentation according to the PSNR metric.

The Welch’s T Test is then applied with of significance between both sets and . Table 2 shows the results of the Welch’s T test.

T statistics -7.6524
df 526.607
p value
Confidence interval
Mean of
Mean of
Table 2: Results for the Welch’s T test between and

The value for the Welch’s T test is and is found in the area of rejection of the null hypothesis. We are left with the acceptance of the alternative hypothesis which indicate that the PSNR values calculated from the bad segmentation masks are superior to the ones calculated by human obtained masks .

6 Final considerations

We investigated the efficacy of the PSNR as an analytic method for segmentation algorithms the same way it’s adopted. We used human created segmentation masks as an ideal reference of a segmentation algorithm and compared the calculated PSNR values from these masks to those calculated from artificially inferior segmentation masks.

To verify if the PSNR is a good evaluation method we compared the values of two sets of calculated PSNR values from good and bad segmentation masks. The mask generation procedure can produce masks that would not be obtainable from threshold algorithms as the values for labels are usually determined by the values of the calculated thresholds. For example, a foreground object on a brighter background would have it’s pixels set to () in the binary mask while the background would be set to (). However, there is no rule for what levels each label should be set to and this could influence the PSNR as well. Some graph based algorithms even separate regions using random colors huang2012robust . Results from such such algorithms could not be verified with the PSNR as it is as they would change greatly from one execution to another.

We proposed the use of Welch’s T test to verify if the difference between the sets of PSNR values from good and bad segmentation is significant. Higher PSNR values for good segmentation masks would suggest the PSNR is in fact a good analytic method. However, the results from the Welch T test suggest exactly the opposite. The values of PSNR value for the bad segmentation masks are significantly superior than the ones for good segmentation masks. Therefore, the PSNR should not be considered an adequate method for evaluation of segmentation algorithms. However, the PSNR is still a good method to evaluate discrepancies between images and could be used to evaluate edge detection algorithms by comparing with ground truth images such as the ones present in the BSR300 database.

Future works could include the verification of multi-threshold algorithms and the determination of the number of thresholds as well as the impact of the label values.

7 Acknowledgment

The authors would like to thank the Berkeley University for the creation and availability of the BSR300 database.


  • (1) Siddharth Arora, Jayadev Acharya, Amit Verma, and Prasanta K Panigrahi. Multilevel thresholding for image segmentation through a fast statistical recursive algorithm. Pattern Recognition Letters, 29(2):119–125, 2008.
  • (2) Jaime S Cardoso and Luís Corte-Real. Toward a generic evaluation of image segmentation. Image Processing, IEEE Transactions on, 14(11):1773–1782, 2005.
  • (3) Yu-Kumg Chen, Fan-Chieh Cheng, and Pohsiang Tsai. A gray-level clustering reduction algorithm with the least¡ i¿ psnr¡/i¿. Expert Systems with Applications, 38(8):10183–10187, 2011.
  • (4) H. Erdmann, G. Wachs-Lopes, C. Gallão, P. M. Ribeiro, and S. P. Rodrigues. Developments in Medical Image Processing and Computational Vision

    , chapter A Study of a Firefly Meta-Heuristics for Multithreshold Image Segmentation, pages 279–295.

    Springer International Publishing, Cham, 2015.
  • (5) Ronald Aylmer Fisher. The asymptotic approach to behrens’s integral, with further tables for the d test of significance. Annals of Eugenics, 11(1):141–172, 1941.
  • (6) Rafael C Gonzalez and Richard E Woods. Digital image processing, 2002.
  • (7) Ming-Huwi Horng and Ren-Jean Liou. Multilevel minimum cross entropy threshold selection based on the firefly algorithm. Expert Systems with Applications, 38(12):14805–14811, 2011.
  • (8) Qing-Hua Huang, Su-Ying Lee, Long-Zhong Liu, Min-Hua Lu, Lian-Wen Jin, and An-Hua Li. A robust graph-based segmentation method for breast tumors in ultrasound images. Ultrasonics, 52(2):266–275, 2012.
  • (9) D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages 416–423, July 2001.
  • (10) Paulo S. Rodrigues and Gilson A. Giraldi. Improving the non-extensive medical image segmentation based on tsallis entropy. Pattern Analysis and Applications, 14(4):369–379, 2011.
  • (11) Student. The probable error of a mean. Biometrika, pages 1–25, 1908.
  • (12) Bernard L Welch. The generalization ofstudent’s’ problem when several different population variances are involved. Biometrika, pages 28–35, 1947.
  • (13) Cao Yun-Fei, Xiao Yong-Hao, Yu Wei-Yu, and Chen Yong-Chang. Multi-level threshold image segmentation based on psnr using artificial bee colony algorithm. China Research Journal of Applied Sciences, Engineering and Technology Published: January, 15, 2011.