1 1.Introduction
The ColorChecker dataset was introduced by Gehler et al. in 2008 [1]. It has 568 images of various daily and ordinary tourist scenes (see Figure 1
) mainly taken in Cambridge with two popular cameras, the Canon 1D and the Canon 5D. The ColorChecker dataset is probably the most widely used dataset in evaluating the performance of algorithms for illuminant estimation.
The goal of most illuminant estimation algorithms is to infer the chromaticity of the light. The reader will notice that each image in Figure 1 has a Macbeth ColorChecker placed in the scene. The RGB from the brightest nonsaturated achromatic patch (see last row of color chart) is the correct answer – or groundtruth – for illuminant estimation, according to the calculation methodology described by Shi and Funt [2]. In fact, most of the recent work in illuminant estimation uses a linear version of the dataset, with linear images reprocessed by Shi and Funt [2] from Gehler’s original raw images [1].
Naturally, the performance of illuminant estimation algorithms – which adopt a range of strategies to infer the RGB of the illuminant – is determined by how well they predict the groundtruth illuminant color. Because we cannot distinguish between a bright scene dimly lit and the converse (because the intensity of the illuminant is not recovered), the measure which is used to quantify the accuracy of the estimation is the angular error. Given a set of angular errors, various statistical summaries are used to summarise performance. These include the mean, median or 95% quantile angular error. Given a set of algorithms and their summary performance statistics, it is natural to rank all the algorithms (according to the statistic), and then to conclude that algorithm A is better than B which is better than C (e.g. if their respective means are in an ascending order).
While there are many image sets (with respect to which, algorithms might be evaluated in performance and ranked), the ColorChecker dataset is the most widely used. Indeed, perhaps the most comprehensive survey of algorithms performance was carried out by Gijsenij et al. [3]. The results reported there are often quoted in more recent papers with the paper now cited 400 times (April 13, 2018). The algorithms considered in this survey also form the basis of the color constancy evaluation site (colorconstancy.com).
Of course, this evaluation process makes perfect sense, in principle. In practice, we observed in [4] a surprising flaw in the methodology. Namely, we discovered that there are 3 different sets of groundtruths, one groundtruth – SFU – on www.cs.sfu.ca/colour/data/shi_gehler/ [2] and two groundtruths – Gt1 and Gt2 – on colorconstancy.com. Unfortunately, these 3 groundtruths are used in a haphazard way. Specifically, an author will adopt one of the three groundtruths, then will evaluate the algorithm and, say, calculate the median angular error. In the next step, this median is compared against other algorithms’ median errors. But, the medians for these competitor algorithms have been calculated with respect to all 3 groundtruth correct answers. This makes no sense, especially since [4] demonstrated that using different sets of groundtruths would drastically affect the ranking of algorithms. At the time of writing this paper, there is no definitive ranking of illuminant estimation algorithms (for the ColorChecker dataset).
Here, we seek to explain why we find ourselves in this multiple groundtruth world. We then go on to make a new ‘recommended’ groundtruth (REC) for the community. Then, we present the results evaluating algorithms using this recommended groundtruth and compare our results to the benchmark evaluation and rankings. The rankings relative to the new recommended groundtruth reveal for the first time the actual pecking order in illuminant estimation for the ColorChecker dataset.
In Section 2, we discuss how we compute the new REC groundtruth and explain why this groundtruth differs from the other 3 in the literature. In section 3, we present some analysis of the relative performance of different algorithms using REC compared to the legacy data. The paper ends with a short conclusion.
2 2. Derivation of the RECommended Groundtruth
We adopt directly the methodology introduced by Shi and Funt [2] to reprocess the raw images and recalculate the groundtruth set of illuminants of the ColorChecker dataset. However, we have used our own code and in the interests of transparency, we will make our code accessible online [5].
Readers who ‘click through’ to this code and the data will find some additional information. Specifically, we explain that our data repository is ‘open’. We set forth a mechanism for the community to provide further suggested modifications to this groundtruth dataset (if necessary). In the short term, we will monitor any suggestions that the community makes and incorporate these suggestions – if they are significant – into the data. This would lead to a new recommended groundtruth. This said, we have been quite careful in our calculations so it is our hope that our new REC groundtruth will stand the test of time. If a modification is suggested, then both the current REC and the updated version will be retained and labelled with clear time stamps.
Let us outline the REC groundtruth calculation. We reprocessed Gehler’s raw images to obtain linear demosaiced images using dcraw [6]. Every image has a ColorChecker in the scene. This color chart provides a reference for measuring the groundtruth illuminant. In each image, the ColorChecker chart is selected and the median RGB from the brightest achromatic patch (ranked by average of the selected squares with no digital count3300 [each image is in 12 bits]) defines a groundtruth illuminant. The main steps of the color chart processing are presented in Figure 3.
One point to highlight is the importance of using the same patch or patches when defining the R, G and B of the light color. We found that this property was not properly enforced in the calculation of the SFU groundtruth. In fact, we found 3 images where saturated color channels/patches were not correctly identified which resulted in having the groundtruth R, G and B values not taken from the same patch, in these three cases.
An important detail about the ColorChecker dataset is the ‘black level’. This is zero for Canon 1D images but is 129 for the Canon 5D. One important contribution of the ShiFunt calculation methodology is the subtraction of the camera black level from the groundtruth. The offsets were estimated from the minimum pixels across the whole dataset [2].
Let us discuss the groundtruths. The groundtruth set we call SFU has the 568 estimates of the RGB lights colors and it appears currently on the SFU site. On colorconstancy.com, there are two additional groundtruths, Gt1 and Gt2. Gt1 was calculated and put online by Gijsenij in 2011. It refers to the same data as SFU but without the black level being subtracted. Gt1 was calculated based on the color chart patches RGB data which is available on the SFU site [2] but without subtracting the offset. A little investigation on our part led us to the following explanation: on the current site [2], it is stated that the black level needs to be subtracted before using the data: “Note that for most applications . . . the black level offset will still need to be subtracted from the original images”. When looking at the same webpage using the waybackmachine.org site – that allows recovering webpages from many years ago – the previous statement is not present and instead we can read ”[the processing] also takes into account that the Canon 5D has a black level of 129, which we subtract.” We spoke to a number of researchers in the field and asked them to read the current and past webpages. Everyone agreed that on reading the past instructions they could easily have made the same ‘mistake’ (and not subtracted the black level). Gijsenij believes the webpage instructions are a plausible explanation of why he did not subtract the black point.
Researchers who have been in the field a long time have been using Gt1 but more recent workers are using SFU. Potentially, all of these researchers think that they are using the same groundtruth. But in fact, the data are very different.
The third legacy groundtruth dataset we call Gt2. It is very similar to Gt1 (and so, also, different from SFU). The reason for the difference between Gt1 and Gt2 is explained by Bianco [22]: “we noticed that for some images the Macbeth ColorChecker coordinates (both the bounding box and the corners of each patch) were wrong and thus the illuminant groundtruth was wrong.” In Figure 2, for a subset of the 568 images, we compare the chromaticity distributions of our new REC groundtruth, SFU and Gt1 (Gt2 is not shown as it is almost the same as Gt1). The reader will notice that the current SFU groundtruth is close to our newly calculated REC
ommended groundtruth except for a few points. These look to be set apart from the rest of the data i.e. they appear to be outliers, in some sense (note that, in Figure
2, the 3 green circles to the left of the plot that do not overlap with the black crosses). These are data points in SFU that are different from REC.We posit that the outliers are due to the problems in calculating correctly the bounding boxes (Bianco’s observation) and to our own discovery that the white point was on occasion drawn from different achromatic patches (for R versus G versus B). Despite this, the preponderance of the data is in, more or less, precise alignment. In contrast, the points in Gt1 are far from REC.
3 3. Evaluating the Performance of Illuminant Estimation Algorithms
In what follows, we provide performance evaluation and ranking of 23 illuminant estimations for the RECommended and the legacy SFU and Gt1 groundtruths. In Table 1
, the 23 algorithms are ranked according to the median recovery angular error (i.e. the conventional error measure, which gives the angle between the estimated illuminant RGB vector and the groundtruth illuminant RGB). The ranking is the same for
REC and SFU, although the median errors are slightly different. This is expected as (recall Figure 2) the number of differences between REC and SFU is small (due to errors in the calculation). However, the ranking of algorithms according to the Gt1 dataset is markedly different (due as opposed to a different methodology of calculation).The five best algorithms with Gt1 are not in the top 5 according to REC. Vice versa, the 5 best algorithms according to REC are among the worstperforming algorithms with Gt1. The algoithms Edgebased Gamut Mapping [9], 2nd order GreyEdge [8] and Bayesian [1][10] are, for example, in reverse order. Fast Fourier Color Constancy [7]
– which is in significant part built on top of a machine learning algorithm – is best according to the
RECommended groundtruth but is the 15th best on Gt1. This may not be surprising as this algorithm was trained on SFU. Deep Color Constancy using CNNs [11] is 6th based on REC and top ranked based on Gt1. Again, this is not surprising since this algorithm was trained on Gt2 (which, we recall, is very similar to Gt1 but very different from SFU and REC).In Table 2, we consider the ranking of the 23 illuminant estimation algorithms in terms of median reproduction angular error [23]. Reproduction error is an angletype error that evaluates how well a white surface is reproduced. Since image reproduction is the goal of most illuminant estimation algorithms, reproduction error provides a more useful measure of algorithm performance. This time, results with REC and SFU locally differ. Once again, the results with Gt1 are significantly different. Finally notice that the ranks for the same groundtruth but recovery vs reproduction angular error results in different rankings.
4 Conclusion
Illuminant estimation algorithms have been evaluated and compared on the benchmark ColorChecker dataset with at least three different groundtruths, with one of the three being very different to the other two. In addition, we found that all three of these sets of groundtruths were inaccurately or incorrectly calculated in the sense that small errors were made. The problem of multiple groundtruths and calculation errors has led to misleading results in the performance evaluation of illuminant estimation algorithms.
In this paper we have introduced a new RECommended groundtruth for this dataset which we hope rehabilitates the ColorChecker dataset. Broadly, we followed the methodology set forth by Shi and Funt but using our own code and we corrected a few errors made (e.g. those reported in [22]). We reevaluated all the algorithms on the widely used comparison site for illuminant estimation algorithms, colorconstancy.com. We invite the community to refer to what we hope is a more definitive comparison in future research.
5 Acknowledgments
This research was supported by EPSRC Grant M001768 and Apple Inc.
References

[1]
P. Gehler, C. Rother, A. Blake, T. Minka, and T. Sharp, “Bayesian Color
Constancy Revisited,” in
IEEE Conf. Comput. Vis. Pattern Recognit.
, pp. 1–8, 2008.  [2] L. Shi and B. Funt, “Reprocessed Version of the Gehler Color Constancy Dataset of 568 Images.” www.cs.sfu.ca/colour/data/shi_gehler/, [accessed in March2018].
 [3] A. Gijsenij, T. Gevers, and J. Van De Weijer, “Computational Color Constancy: Survey and Experiments,” IEEE Trans. Image Process., vol. 20, no. 9, pp. 2475–2489, 2011.
 [4] G. D. Finlayson, G. Hemrit, A. Gijsenij, and P. Gehler, “A Curious Problem with using the Colour Checker Dataset for Illuminant Estimation,” in Color Imaging Conf., pp. 64–69, 2017.
 [5] A. Gijsenij and T. Gevers, “Datasets and Results per Datasets.” www.colorconstancy.com, [first updated in March2018].
 [6] D. Coffin, “Decoding raw digital photos in Linux, download Dcraw 9.27.” www.cybercom.net/dcoffin/dcraw/, [accessed in January2018].
 [7] J. T. Barron and Y.T. Tsai, “Fast Fourier Color Constancy,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
 [8] J. Van De Weijer, T. Gevers, and A. Gijsenij, “EdgeBased Color Constancy,” IEEE Trans. Image Process., vol. 16, no. 9, pp. 2207–2214, 2007.
 [9] A. Gijsenij, T. Gevers, and J. Van De Weijer, “Generalized Gamut Mapping using Image Derivative Structures for Color Constancy,” Int. J. Comput. Vis., vol. 86, no. 2, pp. 127–139, 2010.
 [10] C. Rosenberg, T. Minka, and A. Ladsariya, “Bayesian Color Constancy with NonGaussian Models,” Adv. Neural Inf. Process. Syst. MA MIT Press, vol. 16, 2003.
 [11] S. Bianco, C. Cusano, and R. Schettini, “Color Constancy using CNNs,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 81–89, 2015.
 [12] M. S. Drew and H. R. V. Joze, “ExemplarBased Colour Constancy,” in Br. Mach. Vis. Conf., vol. 26, pp. 1–12, 2012.
 [13] J. Van De Weijer, C. Schmid, and J. Verbeek, “Using HighLevel Visual Information for Color Constancy,” in IEEE Int. Conf. Comput. Vis., pp. 1–8, 2007.
 [14] A. Gijsenij and T. Gevers, “Color Constancy using Natural Image Statistics and Scene Semantics,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 4, pp. 687–698, 2011.
 [15] A. Chakrabarti, K. Hirakawa, and T. Zickler, “Color Constancy with SpatioSpectral Statistics,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 8, pp. 1509–1519, 2012.
 [16] S. Bianco, G. Ciocca, C. Cusano, and R. Schettini, “Automatic Color Constancy Algorithm Selection and Combination,” Pattern Recognit., vol. 43, no. 3, pp. 695–705, 2010.
 [17] E. H. Land and J. J. McCann, “Lightness and Retinex Theory,” J. Opt. Soc. Am., vol. 61, no. 1, pp. 1–11, 1971.
 [18] G. D. Finlayson and E. Trezzi, “Shades of Gray and Colour Constancy,” in Color Imaging Conf., no. 1, pp. 37–41, 2004.
 [19] B. Funt and W. Xiong, “Estimating Illumination Chromaticity via Support Vector Regression,” in Color Imaging Conf., pp. 47–52, 2004.
 [20] G. Buchsbaum, “A Spatial Processor Model for Object Colour Perception,” J. Franklin Inst., vol. 310, no. 1, pp. 1–26, 1980.
 [21] R. Tan, K. Nishino, and K. Ikeuchi, “Illumination Chromaticity Estimation using InverseIntensity Chromaticity Space,” in IEEE Conf. Comput. Vis. Pattern Recognit., pp. 673–680, 2003.
 [22] S. Bianco, “Personal Communication.” [January 12, 2018].
 [23] G. D. Finlayson, R. Zakizadeh, and A. Gijsenij, “The Reproduction Angular Error for Evaluating the Performance of Illuminant Estimation Algorithms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 7, pp. 1482 – 1488, 2016.