1 Introduction
Color constancy is an essential part of digital image processing pipelines. When treated as a computational process, this involves estimation of scene light source color, present at capture time, and correcting an image such that its appearance matches that of the scene captured under an achromatic light source. The algorithmic process of recovering the illuminant of a scene is commonly known as computational Color Constancy (CC) or Automatic White Balance (AWB). Accurate estimation is essential for visual aesthetics [24]
, as well as downstream highlevel computer vision tasks
[1, 4, 13, 17] that typically require colorunbiased and deviceindependent images.Under the prevalent assumption that the scene is illuminated by a single or dominant light source, the observed pixels of an image are typically modelled using the physical model of Lambertian image formation captured under a trichromatic photosensor:
(1) 
where is the intensity of color channel at pixel location , the wavelength of light such that represents the spectrum of the illuminant, the surface reflectance at pixel location and camera sensitivity function for channel , considered over the spectrum of wavelengths .
The goal of computational CC then becomes estimation of the global illumination color where:
(2) 
Finding in Eq. 2 results in a illposed problem due to the existence of infinitely many combinations of illuminant and surface reflectance that result in identical observations at each pixel .
A natural and popular solution for learningbased color constancy is to frame the problem as a regression task [2, 28, 25, 9, 48, 34, 8]. However, typical regression methods provide a point estimate and do not offer any information regarding possible alternative solutions. Solution ambiguity is present in many vision domains [45, 36] and is particularly problematic in the cases where multimodal solutions exist [35]. Specifically for color constancy we note that, due to the illposed nature of the problem, multiple illuminant solutions are often possible with varying probability. Datadriven approaches that learn to directly estimate the illuminant result in learning tasks that are inherently cameraspecific due to the camera sensitivity function c.f. Eq. 2. This observation will often manifest as a sensor domain gap; models trained on a single device typically exhibit poor generalisation to novel cameras.
In this work, we propose to address the ambiguous nature of the color constancy problem through multiple hypothesis estimation. Using a Bayesian formulation, we discretise the illuminant space and estimate the likelihood that each considered illuminant accurately corrects the observed image. We evaluate how plausible an image is after illuminant correction, and gather a discrete set of plausible solutions in the illuminant space. This strategy can be interpreted as framing color constancy as a classification problem, similar to recent promising work in this direction [7, 6, 38]
. Discretisation strategies have also been successfully employed in other computer vision domains, such as 3D pose estimation
[35] and object detection [42, 43], resulting in e.g. state of the art accuracy improvement.In more detail, we propose to decompose the AWB task into three subproblems: a) selection of a set of candidate illuminants b) learning to estimate the likelihood that an image, corrected by a candidate, is illuminated achromatically, and c) combining candidate illuminants, using the estimated posterior probability distribution, to produce a final output.
We correct an image with all candidates independently and evaluate the likelihood of each solution with a shallow CNN. Our network learns to estimate the likelihood of white balance correctness for a given image. In contrast to prior work, we disentangle cameraspecific illuminant estimation from the learning task thus allowing to train a single, device agnostic, AWB model that can effectively leverage multidevice data. We avoid distribution shift and resulting domain gap problems [2, 41, 22], associated with camera specific training, and propose a wellfounded strategy to leverage multiple data. Principled combination of datasets is of high value for learning based color constancy given the typically small nature of individual color constancy datasets (on the order of only hundreds of images). See Figure 1.
Our contributions can be summarised as:

We decompose the AWB problem into a novel multihypothesis three stage pipeline.

We introduce a multicamera learning strategy that allows to leverage multidevice datasets and improve accuracy over singlecamera training.

We provide a trainingfree model adaptation strategy for new cameras.
2 Related work
Classical color constancy methods utilise lowlevel statistics to realise various instances of the grayworld assumption: the average reflectance in a scene under a neutral light source is achromatic. GrayWorld [12] and its extensions [18, 50] are based on these assumptions that tie scene reflectance statistics (e.g. mean, max reflectance) to the achromaticity of scene color.
Related assumptions define perfect reflectance [32, 20] and result in WhitePatch methods. Statistical methods are fast and typically contain few free parameters, however their performance is highly dependent on strong scene content assumptions and these methods falter in cases where these assumptions fail to hold.
An early Bayesian framework [19]
used Bayes’ rule to compute the posterior distribution for the illuminants and scene surfaces. They model the prior of the illuminant and the surface reflectance as a truncated multivariate normal distribution on the weights of a linear model. Other Bayesian works
[44, 23], discretise the illuminant space and model the surface reflectance priors by learning real world histogram frequencies; in [44]the prior is modelled as a uniform distribution over a subset of illuminants while
[23] uses the empirical distribution of the training illuminants. Our work uses the Bayesian formulation proposed in previous works [44, 19, 23]. We estimate the likelihood probability distribution with a CNN which also explicitly learns to model the prior distribution for each illuminant.Fully supervised methods. Early learningbased works [21, 53, 52] comprise combinational and direct approaches, typically relying on handcrafted image features which limited their overall performance. Recent fully supervised convolutional color constancy work offers stateoftheart estimation accuracy. Both local patchbased [8, 48, 9] and full image input [7, 34, 6, 25, 28] have been considered, investigating different model architectures [8, 9, 48] and the use of semantic information [28, 34, 6].
Some methods frame color constancy as a classification problem, e.g. CCC [7] and the followup refinement FFCC [6], by using a color space that identifies image reillumination with a histogram shift. Thus, they elegantly and efficiently evaluate different illuminant candidates. Our method also discretises the illuminant space but we explicitly select the candidate illuminants, allowing for multicamera training while FFCC [6] is constrained to use all histogram bins as candidates and singlecamera training.
The method of [38] uses means [33] to cluster illuminants of the dataset and then applies a CNN to frame the problem as a classification task; network input is a single (prewhite balanced) image and output results in class probabilities, representing the prospect of each illuminant (each class) explaining the correct image illumination. Our method first chooses candidate illuminants similarly, however, the key difference is that our model learns to infer whether an image is well white balanced or not. We ask this question times by correcting the image, independently, with each illuminant candidate. This affords an independent estimation of the likelihood for each illuminant and thus enables multidevice training to improve results.
Multidevice training The method of [2]
introduces a two CNN approach; the first network learns a ‘sensor independent’ linear transformation (
matrix), the RGB image is transformed to this ‘canonical’ color space and then, a second network provides the predicted illuminant. The method is trained on multiple datasets except the test camera and obtains competitive results.The work of [37] affords fast adaptation to previously unseen cameras, and robustness to changes in capture device by leveraging annotated samples across different cameras and datasets in a metalearning framework.
A recent approach [10], makes an assumption that sRGB images collected from the web are well white balanced, therefore, they apply a simple degamma correction to approximate an inverse tone mapping and then find achromatic pixels with a CNN to predict the illuminant. These web images were captured with unknown cameras, were processed by different ISP pipelines and might have been modified with image editing software. Despite additional assumptions, the method achieves promising results, however, not comparable with the supervised stateoftheart.
In contrast we propose an alternative technique to enable multicamera training and mitigate well understood sensor domaingaps. We can train a single CNN using images captured by different cameras through the use of cameradependent illuminant candidates. This property, of accounting for cameradependent illuminants, affords fast model adaption; accurate inference is achievable for images captured by cameras not seen during training, if camera illuminant candidates are available (removing the need for model retraining or finetuning). We provide further methodological detail of these contributions and evidence towards their efficacy in Sections 3 and 4 respectively.
3 Method
Let be a pixel from an input image in linear RGB space. We model the global illumination, Eq. 2, with the standard linear model [51] such that each pixel is the product of the surface reflectance and a global illuminant shared by all pixels such that:
(3) 
Given , comprising pixels, and , our goal is to estimate and produce .
In order to estimate the correct illuminant to adjust the input image , we propose to frame the CC problem with a probabilistic generative model with unknown surface reflectances and illuminant. We consider a set of candidate illuminants, each of which are applied to to generate a set of tentatively corrected images . Using the set of corrected images as inputs, we then train a CNN to identify the most probable illuminants such that the final estimated illuminant is a linear combination of the candidates. In this section, we first introduce our general Bayesian framework, followed by our proposed implementation of the main building blocks of the model. An overview of the method can be seen in Figure 2.
3.1 Bayesian approach to color constancy
Following the Bayesian formulation previously considered [44, 19, 23], we assume that the color of the light and the surface reflectance are independent. Formally , i.e. knowledge of the surface reflectance provides us with no additional information about the illuminant, . Based on this assumption we decompose these factors and model them separately.
Using Bayes’ rule, we define the posterior distribution of illuminants given the input image as:
(4) 
We model the likelihood of an observed image for a given illuminant :
(5) 
where are the surface reflectances and is the image as corrected with illuminant . The term is only nonzero for . The likelihood rates whether a corrected image looks realistic.
We choose to instantiate the model of our likelihood using a shallow CNN. The network should learn to output a high likelihood if the reflectances look realistic. We model the prior probability
for each candidate illuminant independently as learnable parameters in an endtoend approach; this effectively acts as a regularisation, favouring more likely realworld illuminants. We note that, in practice, the function modelling the prior also depends on factors such as the environment (indoor / outdoor), the time of day, ISO etc. However, the size of currently available datasets prevent us from modelling more complex proxies.In order to estimate the illuminant , we optimise the quadratic cost (minimum MSE Bayesian estimator), minimised by the mean of the posterior distribution:
(6) 
This is done in the following three steps (c.f. Figure 2):

Candidate selection (Section 3.2): Choose a set of illuminant candidates to generate corrected thumbnail () images.

Likelihood estimation (Section 3.3): Evaluate these images independently with a CNN, a network designed to estimate the likelihood that an image is well white balanced .

Illuminant determination (Section 3.4): Compute the posterior probability of each candidate illuminant and determine a final illuminant estimation .
This formulation allows estimation of a posterior probability distribution, allowing us to reason about a set of probable illuminants rather than produce a single illuminant point estimate (c.f. regression approaches). Regression typically does not provide feedback on a possible set of alternative solutions which has shown to be of high value in alternative vision problems [35].
The second benefit that our decomposition affords is a principled multicamera training process. A single, device agnostic CNN estimates illuminant likelihoods and performs independent selection of candidate illuminants for each camera. By leveraging image information across multiple datasets we increase model robustness. Additionally, the amalgamation of small available CC datasets provides a step towards harnessing the power of large capacity models for this problem domain c.f. contemporary models.
3.2 Candidate selection
The goal of candidate selection is to discretise the illuminant space of a specific camera in order to obtain a set of representative illuminants (spanning the illuminant space). Given a collection of ground truth illuminants, measured from images containing calibration objects (i.e. a labelled training set), we compute candidates using means clustering [33] on the linear RGB space.
By forming clusters of our measured illuminants, we define the set of candidates as the cluster centers. means illuminant clustering is previously shown to be effective for color constancy [38] however we additionally evaluate alternative candidate selection strategies (detailed in the supplementary material); our experimental investigation confirms a simple means approach provides strong target task performance. Further, the effect of is empirically evaluated in Section 4.4.
Image captured by a given camera, is then used to produce a set of images, corrected using the illuminant candidate set for the camera, on which we evaluate the accuracy of each candidate.
3.3 Likelihood estimation
We model the likelihood estimation step using a neural network which, for a given illuminant
and image , takes the tentatively corrected image as input, and learns to predict the likelihood that the image has been well white balanced i.e. has an appearance of being captured under an achromatic light source.The success of low capacity histogram based methods [7, 6] and the inferencetraining tradeoff for small datasets motivate a compact network design. We propose a small CNN with one spatial convolution and subsequent layers constituting convolutions with spatial pooling. Lastly, three fully connected layers gradually reduce the dimensionality to one (see supplementary material for architecture details). Our network output is then a single value that represents the loglikelihood that the image is well white balanced:
(7) 
Function is our trained CNN parametrised by model weights . Eq. 7 estimates the loglikelihood of each candidate illuminant separately. It is important to note that we only train a single CNN which is used to estimate the likelihood for each candidate illuminant independently. However, in practice, certain candidate illuminants will be more common than others. To account for this, following [6], we compute an affine transformation of our loglikelihood by introducing learnable, illuminant specific, gain and bias parameters. Gain affords amplification of illuminant likelihoods. The bias term learns to prefer some illuminants i.e. a prior distribution in a Bayesian sense: . The logposterior probability can then be formulated as:
(8) 
We highlight that learned affine transformation parameters are training cameradependent and provide further discussion on camera agnostic considerations in Section 3.5.
3.4 Illuminant determination
We require a differentiable method in order to train our model endtoend, and therefore the use of a simple Maximum a Posteriori (MAP) inference strategy is not possible. Therefore to estimate the illuminant , we use the minimum mean square error Bayesian estimator, which is minimised by the posterior mean of (c.f. Eq. 6):
(9) 
The resulting vector is normalised. We leverage our
means centroid representation of the linear RGB space and use linear interpolation within the convex hull of feasible illuminants to determine the estimated scene illuminant
. For Eq. 9, we take inspiration from [29, 38], who have successfully explored similar strategies in CC and stereo regression, e.g. [29] introduced an analogous softargmin to estimate disparity values from a set of candidates. We apply a similar strategy for illuminant estimation and use the softargmax which provides a linear combination of all candidates weighted by their probabilities.We train our network endtoend with the commonly used angular error loss function, where
and are the prediction and ground truth illuminant, respectively:(10) 
3.5 Multidevice training
As discussed in previous work [2, 41, 22], CC models typically fail to train successfully using multiple camera data due to distribution shifts between camera sensors, making them intrinsically devicedependent and limiting model capacity. A deviceindependent model is highly appealing due to the small number of images commonly available in cameraspecific public color constancy datasets. The cost and time associated with collecting and labelling new large data for specific novel devices is expensive and prohibitive.
Our CNN learns to produce the likelihood that an input image is well white balanced. We claim that framing part of the CC problem in this fashion results in a deviceindependent learning task. We evaluate the benefit of this hypothesis experimentally in Section 4.
To train with multiple cameras we use cameraspecific candidates, yet learn only a single model. Specifically, we train with a different camera for each batch, use cameraspecific candidates yet update a single set of CNN parameters during model training. In order to ensure that our CNN is deviceindependent, we fix previously learnable parameters that depend on sensor specific illuminants, i.e. and . The absence of these parameters, learned in a cameradependent fashion, intuitively restricts model flexibility however we observe this drawback to be compensated by the resulting ability to train using amalgamated multicamera datasets i.e. more data. This strategy allows our CNN to be cameraagnostic and affords the option to refine existing CNN quality when data from novel cameras becomes available. We however clarify that our overarching strategy for white balancing maintains use of cameraspecific candidate illuminants.
4 Results
4.1 Training details
We train our models for epochs and use mean [33] with candidates. Our batch size is , we use the Adam optimiser [30] with initial learning rate , divided by two after , and epochs. Dropout [27]
of 50% is applied after average pooling. We take the log transform of the input before the first convolution. Efficient inference is feasible by concatenating each candidate corrected image into the batch dimension. We use PyTorch 1.0
[39] and an Nvidia Tesla V100 for our experiments. The first layer is the only spatial convolution, it is adapted from [49]and pretrained on ImageNet
[16]. We fix the weights of this first layer to avoid overfitting. The total amount of weights is . For all experiments calibration objects are masked, black level subtracted and oversaturated pixels are clipped at threshold. We resize the image to and normalise.4.2 Datasets
We experiment using three public datasets. The GehlerShi dataset [47, 23] contains images of indoor and outdoor scenes. Images were captured using Canon 1D and Canon 5D cameras. We highlight our awareness of the existence of multiple sets of nonidentical groundtruth labels for this dataset (see [26] for further detail). Our GehlerShi evaluation is conducted using the SFU groundtruth labels [47] (consistent with the label naming convention in [26]). The NUS dataset [14] originally consists of subsets of 210 images per camera providing a total of images. The Cube+ dataset [5] contains images captured with Canon 550D camera, consisting of predominantly outdoor imagery.
For the NUS [14] and GehlerShi [47, 23] datasets we perform threefold cross validation (CV) using the splits provided in previous work [6, 7]. The Cube+ [5] dataset does not provide splits for CV so we use all images for learning and evaluate using a related set of test images, provided for the recent Cube+ ISPA 2019 challenge [31]. We compare with the results from the challenge leaderboard.
For the NUS dataset [14], we additionally explore training multicamera models and thus create a new set of CV folds to facilitate this. We are careful to highlight that the NUS dataset consists of eight image subsets, pertaining to eight capture devices. Each of our new folds captures a distinct set of scene content (i.e. sets of up to eight similar images for each captured scene). This avoids testing on similar scene content seen during training. We define our multicamera CV such that multicamera fold is the concatenation of images, pertaining to common scenes, captured from all eight cameras. The folds that we define are made available in our supplementary material.
4.3 Evaluation metrics
We use the standard angular error metric for quantitative evaluation (c.f. Eq. 10). We report standard CC statistics to summarise results over the investigated datasets: Mean, Median, Trimean, Best , Worst . We further report method inference time in the supplementary material. Other works’ results were taken from corresponding papers, resulting in missing statistics for some methods. The NUS [14] dataset is composed of
cameras, we report the geometric mean of each statistic for each method across all cameras as standard in the literature
[6, 7, 28].4.4 Quantitative evaluation
Accuracy experiments. We report competitive results on the dataset of GehlerShi [47, 23] (c.f. Table 1). This dataset can be considered very challenging as the number of images per camera is imbalanced: There are Canon 1D and Canon 5D images. Our method is not able to outperform the stateoftheart likely due to the imbalanced nature and small size of Canon 1D. Pretraining on a combination of NUS [14] and Cube+ [5] provides moderate accuracy improvement despite the fact that the GehlerShi dataset has a significantly different illuminant distribution compared to those seen during pretraining. We provide additional experiments, exploring the effect of varying , for means candidate selection in the supplementary material.
Results for NUS [14] are provided in Table 2. Our method obtains competitive accuracy and the previously observed trend, pretraining using additional datasets (here GehlerShi [47, 23] and Cube+ [5]), again improves results.
In Table 3, we report results for our multidevice setting on the NUS [14] dataset. For this experiment we introduce a new set of training folds to ensure that scenes are well separated and refer to Sections 3.5 for multidevice training and 4.2 for related training folds detail. We draw multidevice comparison with FFCC [6], by choosing to center the FFCC histogram with the training set (of amalgamated camera datasets). Note that results are not directly comparable with Table 2 due to our redefinition of CV folds. Our method is more accurate than the stateoftheart when training considers all available cameras at the same time. Note that multidevice training improves the median angular error of each individual camera dataset (we provide results in the supplementary material). Overall performance is improved by in terms of median accuracy.
We also outperform the stateoftheart on the recent Cube challenge [31] as shown in Table 4. Pretraining together on GehlerShi [47, 23] and NUS [14] improves our Mean and Worst 95% statistics.
In summary, we observe strong generalisation when using multiple camera training (e.g. NUS [14] results c.f. Tables 3 and 2).
These experiments illustrate the large benefit achievable with multicamera training when illuminant distributions of the cameras are broadly consistent.
GehlerShi [47, 23] has a very disparate illuminant distribution with respect to alternative datasets and we are likely unable to exploit the full advantage of multicamera training. We note the FFCC [6] state of the art method is extremely shallow and therefore optimised for small datasets. In contrast, when our model is trained on large and relevant datasets we are able to achieve superior results.
Run time.
Regarding runtime; we measure inference speed at milliseconds, implemented in unoptimised PyTorch (see supplementary material for further detail).
4.5 Training on novel sensors
To explore camera agnostic elements of our model, we train on a combination of the full NUS [14] and GehlerShi [47, 23] datasets. As described in Section 3.5, the only remaining device dependent component involves performing illuminant candidate selection per device. Once the model is trained, we select candidates from Cube+ [5] and test on the Cube challenge dataset [31]. We highlight that neither Cube+ nor Cube challenge imagery is seen during model training. For meaningful evaluation, we compare against both classical and recent learningbased [2] cameraagnostic methods. Results are shown in Table 5. We obtain results that are comparable to Table 4 without seeing any imagery from our target camera, outperforming both baselines and [2]. We clarify that our method performs candidate selection using Cube+ [5] to adapt the candidate set to the novel device while [2] does not see any information from the new camera.
We provide additional experimental results for differing values of (means candidate selection) in the supplementary material. We observe stability for . The low number of candidates required is likely linked to the two Cube datasets having reasonably compact distributions.
4.6 Qualitative evaluation
We provide visual results for the GehlerShi [47, 23] dataset in Figure 21. We sort inference results by increasing angular error and sample images uniformly. For each row, we show (a) the input image (b) our estimated illuminant color and resulting whitebalanced image (c) the ground truth illuminant color and resulting whitebalanced image. Images are first whitebalanced, then, we apply an estimated CCM (Color Correction Matrix), and finally, sRGB gamma correction. We mask out the Macbeth Color Checker calibration object during both training and evaluation.
Our most challenging example (c.f. last row of Figure 21) is a multiilluminant scene (indoor and outdoor lights), we observe our method performs accurate correction for objects illuminated by the outdoor light, yet the ground truth is only measured for the indoor illuminant, hence the high angular error. This highlights the limitation linked to our single global illuminant assumption, common to the majority of CC algorithms. We show additional qualitative results in the supplementary material.










5 Conclusion
We propose a novel multihypothesis color constancy model capable of effectively learning from image samples that were captured by multiple cameras. We frame the problem under a Bayesian formulation and obtain datadriven likelihood estimates by learning to classify achromatic imagery. We highlight the challenging nature of multidevice learning due to camera color space differences, spectral sensitivity and physical sensor effects. We validate the benefits of our proposed solution for multidevice learning and provide stateoftheart results on two popular color constancy datasets while maintaining realtime inference constraints. We additionally provide evidence supporting our claims that framing the learning question as a classification task
c.f. regression can lead to strong performance without requiring model retraining or finetuning.References

[1]
(2019)
What else can fool deep learning? addressing color constancy errors on deep neural network performance
. In 2019 IEEE International Conference on Computer Vision, ICCV 2019, Seoul, Korea, October 29November 1, 2019, Cited by: §1.  [2] (2019) SensorIndependent Illumination Estimation for DNN Models. In Proceedings of the British Machine Vision Conference 2019, BMVC 2019, Cardiff University, Cardiff, UK, September 912, 2019, Cited by: §1, §1, §2, §3.5, §4.5, Table 1, Table 2, Table 4, Table 5.

[3]
(2019)
When color constancy goes wrong: correcting improperly whitebalanced images.
In
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 1620, 2019
, pp. 1535–1544. External Links: Link, Document Cited by: Table 4.  [4] (2012) On sensor bias in experimental methods for comparing interestpoint, saliency, and recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1), pp. 110–126. External Links: Link, Document Cited by: §1.
 [5] (2018) Unsupervised learning for color constancy. In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018)  Volume 4: VISAPP, Funchal, Madeira, Portugal, January 2729, 2018, pp. 181–188. External Links: Link, Document Cited by: Table 8, item 4, §4.2, §4.2, §4.4, §4.4, §4.5, Table 5.
 [6] (2017) Fast fourier color constancy. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pp. 6950–6958. External Links: Link, Document Cited by: Table 11, Table 12, Appendix E, Figure G.21, Figure G.42, Appendix G, A MultiHypothesis Approach to Color Constancy: supplementary material, Figure 1, §1, §2, §2, §3.3, §3.3, §4.2, §4.3, §4.4, §4.4, Table 1, Table 2, Table 3, Table 4.
 [7] (2015) Convolutional color constancy. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 713, 2015, pp. 379–387. External Links: Link, Document Cited by: Table 11, Appendix G, §1, §2, §2, §3.3, §4.2, §4.3, Table 1, Table 2.
 [8] (2015) Color constancy using cnns. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2015, Boston, MA, USA, June 712, 2015, pp. 81–89. External Links: Link, Document Cited by: §1, §2.

[9]
(2017)
Single and multiple illuminant estimation using convolutional neural networks
. IEEE Transactions on Image Processing 26 (9), pp. 4347–4362. External Links: Link, Document Cited by: §1, §2.  [10] (2019) Quasiunsupervised color constancy. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 1620, 2019, pp. 12212–12221. External Links: Link, Document Cited by: §2, Table 1, Table 2.
 [11] (1986) Analysis of the retinex theory of color vision. JOSA A 3 (10), pp. 1651–1661. Cited by: Table 1, Table 2.
 [12] (1980) A spatial processor model for object colour perception. Journal of the Franklin institute 310 (1), pp. 1–26. Cited by: §2, Table 1, Table 2, Table 4, Table 5.
 [13] (2018) Modeling camera effects to improve deep vision for real and synthetic data. CoRR abs/1803.07721. External Links: Link, 1803.07721 Cited by: §1.
 [14] (2014) Illuminant estimation for color constancy: why spatialdomain methods work and the role of the color distribution. JOSA A 31 (5), pp. 1049–1058. Cited by: Table 8, Table 9, Appendix C, Figure G.21, Figure G.42, Appendix G, A MultiHypothesis Approach to Color Constancy: supplementary material, Figure 1, item 4, §4.2, §4.2, §4.2, §4.3, §4.4, §4.4, §4.4, §4.4, §4.4, §4.5, Table 2, Table 3, Table 5.
 [15] (2015) Effective learningbased illuminant estimation using simple features. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 712, 2015, pp. 1000–1008. External Links: Link, Document Cited by: Table 11, Table 1, Table 2.
 [16] (2009) ImageNet: A largescale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 2025 June 2009, Miami, Florida, USA, pp. 248–255. External Links: Link, Document Cited by: §4.1.
 [17] (2017) Dirty pixels: optimizing image classification architectures for raw sensor data. CoRR abs/1701.06487. External Links: Link, 1701.06487 Cited by: §1.
 [18] (2004) Shades of gray and colour constancy. In The Twelfth Color Imaging Conference: Color Science and Engineering Systems, Technologies, Applications, CIC 2004, Scottsdale, Arizona, USA, November 912, 2004, pp. 37–41. External Links: Link Cited by: §2.
 [19] (1995) Bayesian decision theory, the maximum local mass estimate, and color constancy. In Procedings of the Fifth International Conference on Computer Vision (ICCV 95), Massachusetts Institute of Technology, Cambridge, Massachusetts, USA, June 2023, 1995, pp. 210–217. External Links: Link, Document Cited by: §2, §3.1.
 [20] (2010) The rehabilitation of maxrgb. In 18th Color and Imaging Conference, CIC 2010, San Antonio, Texas, USA, November 812, 2010, pp. 256–259. External Links: Link Cited by: §2.
 [21] (2004) Estimating illumination chromaticity via support vector regression. In The Twelfth Color Imaging Conference: Color Science and Engineering Systems, Technologies, Applications, CIC 2004, Scottsdale, Arizona, USA, November 912, 2004, pp. 47–52. External Links: Link Cited by: §2.
 [22] (2017) Improving color constancy by discounting the variation of camera spectral sensitivity. JOSA A 34 (8), pp. 1448–1462. Cited by: §1, §3.5.
 [23] (2008) Bayesian color constancy revisited. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 2426 June 2008, Anchorage, Alaska, USA, External Links: Link, Document Cited by: Table 7, Table 8, Appendix B, Table 11, Table 12, Appendix E, item 4, §2, §3.1, Figure 21, §4.2, §4.2, §4.4, §4.4, §4.4, §4.4, §4.5, §4.6, Table 1, Table 2, Table 5.
 [24] (2009) Perceptual analysis of distance measures for color constancy algorithms. JOSA A 26 (10), pp. 2243–2256. Cited by: §1.
 [25] (2019) Convolutional mean: a simple convolutional neural network for illuminant estimation. In Proceedings of the British Machine Vision Conference 2019, BMVC 2019, Cardiff University, Cardiff, UK, September 912, 2019, Cited by: Table 11, §1, §2, Table 1, Table 2.
 [26] (2018) Rehabilitating the colorchecker dataset for illuminant estimation. In Color and Imaging Conference, Vol. 2018, pp. 350–353. Cited by: §4.2.
 [27] (2012) Improving neural networks by preventing coadaptation of feature detectors. CoRR abs/1207.0580. External Links: Link, 1207.0580 Cited by: §4.1.
 [28] (2017) FC^4: fully convolutional color constancy with confidenceweighted pooling. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pp. 330–339. External Links: Link, Document Cited by: Table 11, §1, §2, §4.3, Table 1, Table 2.
 [29] (2017) Endtoend learning of geometry and context for deep stereo regression. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, pp. 66–75. External Links: Link, Document Cited by: §3.4.
 [30] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.1.
 [31] ISPA 2019 Illumination Estimation Challenge. Note: https://www.isispa.org/illuminationestimationchallengeAccessed November 14, 2019 Cited by: Table 8, Table 10, Appendix D, §4.2, §4.4, §4.5, Table 4, Table 5.
 [32] (1971) Lightness and retinex theory. Josa 61 (1), pp. 1–11. Cited by: §2.
 [33] (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory 28 (2), pp. 129–136. External Links: Link, Document Cited by: Figure 2, §2, §3.2, §4.1.
 [34] (2015) Color constancy by deep learning. In Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK, September 710, 2015, pp. 76.1–76.12. External Links: Link, Document Cited by: §1, §2.
 [35] (2018) A mixed classificationregression framework for 3d pose estimation from 2d images. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 36, 2018, pp. 72. External Links: Link Cited by: §1, §1, §3.1.
 [36] (2019) Explaining the ambiguity of object detection and 6d pose from visual data. 2019 IEEE International Conference on Computer Vision, ICCV 2019, Seoul, Korea, October 29November 1, 2019. Cited by: §1.
 [37] (2018) Metalearning for fewshot cameraadaptive color constancy. CoRR abs/1811.11788. External Links: Link, 1811.11788 Cited by: §2, Table 1, Table 2.
 [38] (2017) Approaching the computational color constancy as a classification problem through deep learning. Pattern Recognition 61, pp. 405–416. External Links: Link, Document Cited by: Appendix D, §1, §2, §3.2, §3.4, Table 1, Table 2.
 [39] (2019) PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 814 December 2019, Vancouver, BC, Canada, pp. 8024–8035. External Links: Link Cited by: Appendix E, §4.1.
 [40] (2019) Fast fourier color constancy and grayness index for ISPA illumination estimation challenge. In 11th International Symposium on Image and Signal Processing and Analysis, ISPA 2019, Dubrovnik, Croatia, September 2325, 2019, pp. 352–354. External Links: Link, Document Cited by: Table 4.
 [41] (2014) Rawtoraw: mapping between image sensor color responses. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 2328, 2014, pp. 3398–3405. External Links: Link, Document Cited by: §1, §3.5.
 [42] (2016) You only look once: unified, realtime object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016, pp. 779–788. External Links: Link, Document Cited by: §1.
 [43] (2017) Faster RCNN: towards realtime object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. External Links: Link, Document Cited by: §1.
 [44] (2003) Bayesian color constancy with nongaussian models. In Advances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 813, 2003, Vancouver and Whistler, British Columbia, Canada], pp. 1595–1602. External Links: Link Cited by: §2, §3.1.
 [45] (2017) Learning in an uncertain world: representing ambiguity through multiple hypotheses. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 2229, 2017, pp. 3611–3620. External Links: Link, Document Cited by: §1.
 [46] (2019) Color cerberus. In 11th International Symposium on Image and Signal Processing and Analysis, ISPA 2019, Dubrovnik, Croatia, September 2325, 2019, pp. 355–359. External Links: Link, Document Cited by: Table 4.
 [47] Reprocessed version of the gehler color constancy dataset. Note: https://www2.cs.sfu.ca/~colour/data/shi_gehler/Accessed November 14, 2019 Cited by: Table 7, Table 8, Appendix B, Table 11, Table 12, Appendix E, item 4, Figure 21, §4.2, §4.2, §4.4, §4.4, §4.4, §4.4, §4.5, §4.6, Table 1, Table 5.
 [48] (2016) Deep specialized network for illuminant estimation. In Computer Vision  ECCV 2016  14th European Conference, Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part IV, pp. 371–387. External Links: Link, Document Cited by: §1, §2, Table 1, Table 2.
 [49] (2015) Very deep convolutional networks for largescale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.1.
 [50] (2007) Edgebased color constancy. IEEE Transactions on Image Processing 16 (9), pp. 2207–2214. External Links: Link, Document Cited by: §2, Table 4, Table 5.
 [51] (1905) Influence of adaptation on the effects produced by luminous stimuli. handbuch der Physiologie des Menschen 3, pp. 109–282. Cited by: §3.
 [52] (2009) Edgebased color constancy via support vector regression. IEICE Transactions on Information and Systems 92D (11), pp. 2279–2282. External Links: Link, Document Cited by: §2.
 [53] (2006) Estimating illumination chromaticity via support vector regression. Journal of Imaging Science and Technology 50 (4), pp. 341–348. Cited by: §2.
A MultiHypothesis Approach to Color Constancy: supplementary material
We provide additional material to supplement our main paper. In Appendix A, we present our shallow CNN architecture. Two experimental studies on the number of illuminant candidates are provided in Appendix B. In Appendix C, we report details on NUS [14] percamera median angular error to provide evidence for our claim that we consistently improve accuracy for each camera, using multicamera training (see main paper Section ). In Appendix D, we show additional results from our exploration of candidate selection strategy. Appendix E provides runtime measurements and in Appendix F we observe failure cases and discuss limitations of our method. Finally, Appendix G provides additional visual results comparing our method with FFCC [6].
Appendix A Architecture details
In Table 6, we present our CNN architecture. We propose a shallow CNN, one spatial convolution and two subsequent layers constituting convolutions with a final global spatial pooling. Lastly, three fully connected layers gradually reduce the dimensionality to one.
Layer  Kernel  Input  Output 

Conv.  
Conv.  
Conv.  
Avg. Pool.  
FC    
FC    
FC   
CNN architecture details. Fully connected layers and convolutions are followed by a ReLU activation except the last layer.
Appendix B Number of illuminant candidates
In Table 7 we present a study varying the number of candidate illuminants produced by means. We find experimentally that accuracy improves with the number of cluster centres until a plateau is reached, suggesting that we need candidate illuminants to achieve competitive angular error for the GehlerShi dataset [47, 23].
Additionally, we provide analogous results for different values of for means candidate selection for the trainingfree model (see main paper Section 4.5), in Table 8. We observe stability for . The low number of candidates required is likely linked to the two Cube datasets having reasonably compact illuminant distributions.
Appendix C NUS percamera median angular error
We provide evidence supporting our paper claim that training the proposed model with images from multiple cameras outperforms individual, percamera, model training (see Section , of the main paper).
We reiterate that folds are divided such that scene content is consistent within a fold, across all cameras. This ensures to avoid testing on familiar scene content, as observed by a different camera during training. Towards reproducibility, and fair comparison, our suppplementary material provides the cross validation (CV) splits, used in the main paper, for multidevice training. CV splits were generated manually by ensuring that all images of the same scene (across different cameras) belong to the same fold.
Appendix D Candidate selection methods
We report additional illuminant candidate selection strategies explored during our investigation.
Uniformsampling: we consider the global extrema of our measured illuminant samples (max. and min. in each color space dimension) and sample points uniformly using an [, ] color space. These samples constitute our illuminant candidates.
means clustering: cluster centroids define candidates, as detailed in the main paper, Section and other recent color constancy work [38]. We use RGB color space for clustering, and experimentally verified that both [, ] and RGB color spaces provided similar accuracy.
Mixture Model (GMM): we fit a GMM to our measured illuminant samples in [, ] color space, and then draw samples from the GMM to define illuminant candidates.
We use candidates ( grid) for uniform candidate selection. For GMM candidate selection, we fit
twodimensional Gaussian distributions and sample
candidates.In Table 10 we report inference performance on the Cube challenge [31] data set using the described candidate selection strategies. We observe that simple uniformsampling candidate selection performs reasonably well. The strategy provides an extremely simple implementation yet, by definition, will also sample some portion of very unlikely candidates. We note, however, that if the interpolation between candidates span the illuminant space, our method can learn to interpolate these candidates appropriately, accounting for this. The GMM approach also results in slightly weaker accuracy performance c.f. means, motivating our choice of sampling strategy in the experimental work for the main paper.
Appendix E Inference runtime
We report inference runtime results for the GehlerShi dataset [47, 23] in Table 11. We note that our realtime inference speed is obtained using a Nvidia Tesla V100 card and unoptimised implementation (PyTorch 1.0 [39]). We highlight that our algorithm is highly parallelizable, each illuminant candidate likelihood can be computed independently, however, we obtain the runtime with singlethread implementation. Our input image resolution is and timing results are recorded using means candidate selection with . The timing performance of other methods are obtained from their respective citations. We acknowledge that timing comparisons are nonrigorous; reported runtimes are measured using differing hardware. To provide additional fair comparison; Table 12 reports runtimes for both our method and the official^{1}^{1}1https://github.com/google/ffcc FFCC [6] implementation run on Matlab R2019b, under common hardware (Intel Core i99900X (3.50GHz)).
Appendix F Failure cases






In Figures F.10, F.7 and F.4 we provide observed limitations and failure cases. Our method learns to interpolate between candidate illuminants, that are observed during training, but not to extrapolate to new illuminants. In Figure (c)c, the ground truth illuminant (green filled circle) is clearly out of distribution, with no similar candidate illuminants observed during training. The resulting inference accuracy in Figure (a)a suffers as a result.
Further, our single global illuminant assumption can be seen to be violated in Figure F.7. The predicted illuminant attempts to balance the outer boundary portions of the wall painting as achromatic, clearly illuminated from above (out of shot). The measured ground truth illuminant captures the desk lamp illumination, resulting in high angular error for this image due to the global assumption.
Finally, in Figure F.10, we observe an example scene with extreme ambiguities. Our method appears to infer that the stone building in the scene background is achromatic, producing a highly plausible image. Yet the measured groundtruth illuminant illustrates the true building color to be of mild beigeyellow.
Appendix G Additional qualitative results
In Figure G.21, we provide additional qualitative results in the form of test images from the NUS [14] dataset (Sony camera). For each test sample we show the input image and a whitebalanced image, corrected using the groundtruth illumination in addition to the output of our model (“multidevice training + pretraining”), and that of FFCC (model Q) [6]. Each row consists of: (a) the input image (b) FFCC [6] (c) our prediction (d) ground truth.
In similar fashion to [7], we adopt the strategy of sorting test images by the combined mean angularerror of the two evaluated methods. We present images of increasing average difficulty, sampled with a uniform spacing. Images are corrected by inferred illuminants, applying an estimated CCM (Color Correction Matrix), and standard sRGB gamma correction. The Macbeth Color Checker is used to generate the groundtruth and is present in the images, however the relevant regions are masked during both training and inference. It can be observed in Figure G.21 in almost all sampled cases, we see consistently improved results with our approach.
We provide further extremely challenging examples in Figure G.42. We explicitly select the five largest combined mean angularerror images. We observe that our method shows consistently strong performance and also highlight that these samples constitute cases of both ambiguous and multiilluminant scenes, breaking the fundamental global illuminant assumption (made by both methods).






























Comments
There are no comments yet.