Color constancy is an essential part of digital image processing pipelines. When treated as a computational process, this involves estimation of scene light source color, present at capture time, and correcting an image such that its appearance matches that of the scene captured under an achromatic light source. The algorithmic process of recovering the illuminant of a scene is commonly known as computational Color Constancy (CC) or Automatic White Balance (AWB). Accurate estimation is essential for visual aesthetics 
, as well as downstream high-level computer vision tasks[1, 4, 13, 17] that typically require color-unbiased and device-independent images.
Under the prevalent assumption that the scene is illuminated by a single or dominant light source, the observed pixels of an image are typically modelled using the physical model of Lambertian image formation captured under a trichromatic photosensor:
where is the intensity of color channel at pixel location , the wavelength of light such that represents the spectrum of the illuminant, the surface reflectance at pixel location and camera sensitivity function for channel , considered over the spectrum of wavelengths .
The goal of computational CC then becomes estimation of the global illumination color where:
Finding in Eq. 2 results in a ill-posed problem due to the existence of infinitely many combinations of illuminant and surface reflectance that result in identical observations at each pixel .
A natural and popular solution for learning-based color constancy is to frame the problem as a regression task [2, 28, 25, 9, 48, 34, 8]. However, typical regression methods provide a point estimate and do not offer any information regarding possible alternative solutions. Solution ambiguity is present in many vision domains [45, 36] and is particularly problematic in the cases where multi-modal solutions exist . Specifically for color constancy we note that, due to the ill-posed nature of the problem, multiple illuminant solutions are often possible with varying probability. Data-driven approaches that learn to directly estimate the illuminant result in learning tasks that are inherently camera-specific due to the camera sensitivity function c.f. Eq. 2. This observation will often manifest as a sensor domain gap; models trained on a single device typically exhibit poor generalisation to novel cameras.
In this work, we propose to address the ambiguous nature of the color constancy problem through multiple hypothesis estimation. Using a Bayesian formulation, we discretise the illuminant space and estimate the likelihood that each considered illuminant accurately corrects the observed image. We evaluate how plausible an image is after illuminant correction, and gather a discrete set of plausible solutions in the illuminant space. This strategy can be interpreted as framing color constancy as a classification problem, similar to recent promising work in this direction [7, 6, 38]
. Discretisation strategies have also been successfully employed in other computer vision domains, such as 3D pose estimation and object detection [42, 43], resulting in e.g. state of the art accuracy improvement.
In more detail, we propose to decompose the AWB task into three sub-problems: a) selection of a set of candidate illuminants b) learning to estimate the likelihood that an image, corrected by a candidate, is illuminated achromatically, and c) combining candidate illuminants, using the estimated posterior probability distribution, to produce a final output.
We correct an image with all candidates independently and evaluate the likelihood of each solution with a shallow CNN. Our network learns to estimate the likelihood of white balance correctness for a given image. In contrast to prior work, we disentangle camera-specific illuminant estimation from the learning task thus allowing to train a single, device agnostic, AWB model that can effectively leverage multi-device data. We avoid distribution shift and resulting domain gap problems [2, 41, 22], associated with camera specific training, and propose a well-founded strategy to leverage multiple data. Principled combination of datasets is of high value for learning based color constancy given the typically small nature of individual color constancy datasets (on the order of only hundreds of images). See Figure 1.
Our contributions can be summarised as:
We decompose the AWB problem into a novel multi-hypothesis three stage pipeline.
We introduce a multi-camera learning strategy that allows to leverage multi-device datasets and improve accuracy over single-camera training.
We provide a training-free model adaptation strategy for new cameras.
2 Related work
Classical color constancy methods utilise low-level statistics to realise various instances of the gray-world assumption: the average reflectance in a scene under a neutral light source is achromatic. Gray-World  and its extensions [18, 50] are based on these assumptions that tie scene reflectance statistics (e.g. mean, max reflectance) to the achromaticity of scene color.
Related assumptions define perfect reflectance [32, 20] and result in White-Patch methods. Statistical methods are fast and typically contain few free parameters, however their performance is highly dependent on strong scene content assumptions and these methods falter in cases where these assumptions fail to hold.
An early Bayesian framework 
used Bayes’ rule to compute the posterior distribution for the illuminants and scene surfaces. They model the prior of the illuminant and the surface reflectance as a truncated multivariate normal distribution on the weights of a linear model. Other Bayesian works[44, 23], discretise the illuminant space and model the surface reflectance priors by learning real world histogram frequencies; in 
the prior is modelled as a uniform distribution over a subset of illuminants while uses the empirical distribution of the training illuminants. Our work uses the Bayesian formulation proposed in previous works [44, 19, 23]. We estimate the likelihood probability distribution with a CNN which also explicitly learns to model the prior distribution for each illuminant.
Fully supervised methods. Early learning-based works [21, 53, 52] comprise combinational and direct approaches, typically relying on hand-crafted image features which limited their overall performance. Recent fully supervised convolutional color constancy work offers state-of-the-art estimation accuracy. Both local patch-based [8, 48, 9] and full image input [7, 34, 6, 25, 28] have been considered, investigating different model architectures [8, 9, 48] and the use of semantic information [28, 34, 6].
Some methods frame color constancy as a classification problem, e.g. CCC  and the follow-up refinement FFCC , by using a color space that identifies image re-illumination with a histogram shift. Thus, they elegantly and efficiently evaluate different illuminant candidates. Our method also discretises the illuminant space but we explicitly select the candidate illuminants, allowing for multi-camera training while FFCC  is constrained to use all histogram bins as candidates and single-camera training.
The method of  uses -means  to cluster illuminants of the dataset and then applies a CNN to frame the problem as a classification task; network input is a single (pre-white balanced) image and output results in class probabilities, representing the prospect of each illuminant (each class) explaining the correct image illumination. Our method first chooses candidate illuminants similarly, however, the key difference is that our model learns to infer whether an image is well white balanced or not. We ask this question times by correcting the image, independently, with each illuminant candidate. This affords an independent estimation of the likelihood for each illuminant and thus enables multi-device training to improve results.
Multi-device training The method of 
introduces a two CNN approach; the first network learns a ‘sensor independent’ linear transformation (matrix), the RGB image is transformed to this ‘canonical’ color space and then, a second network provides the predicted illuminant. The method is trained on multiple datasets except the test camera and obtains competitive results.
The work of  affords fast adaptation to previously unseen cameras, and robustness to changes in capture device by leveraging annotated samples across different cameras and datasets in a meta-learning framework.
A recent approach , makes an assumption that sRGB images collected from the web are well white balanced, therefore, they apply a simple de-gamma correction to approximate an inverse tone mapping and then find achromatic pixels with a CNN to predict the illuminant. These web images were captured with unknown cameras, were processed by different ISP pipelines and might have been modified with image editing software. Despite additional assumptions, the method achieves promising results, however, not comparable with the supervised state-of-the-art.
In contrast we propose an alternative technique to enable multi-camera training and mitigate well understood sensor domain-gaps. We can train a single CNN using images captured by different cameras through the use of camera-dependent illuminant candidates. This property, of accounting for camera-dependent illuminants, affords fast model adaption; accurate inference is achievable for images captured by cameras not seen during training, if camera illuminant candidates are available (removing the need for model re-training or fine-tuning). We provide further methodological detail of these contributions and evidence towards their efficacy in Sections 3 and 4 respectively.
Let be a pixel from an input image in linear RGB space. We model the global illumination, Eq. 2, with the standard linear model  such that each pixel is the product of the surface reflectance and a global illuminant shared by all pixels such that:
Given , comprising pixels, and , our goal is to estimate and produce .
In order to estimate the correct illuminant to adjust the input image , we propose to frame the CC problem with a probabilistic generative model with unknown surface reflectances and illuminant. We consider a set of candidate illuminants, each of which are applied to to generate a set of tentatively corrected images . Using the set of corrected images as inputs, we then train a CNN to identify the most probable illuminants such that the final estimated illuminant is a linear combination of the candidates. In this section, we first introduce our general Bayesian framework, followed by our proposed implementation of the main building blocks of the model. An overview of the method can be seen in Figure 2.
3.1 Bayesian approach to color constancy
Following the Bayesian formulation previously considered [44, 19, 23], we assume that the color of the light and the surface reflectance are independent. Formally , i.e. knowledge of the surface reflectance provides us with no additional information about the illuminant, . Based on this assumption we decompose these factors and model them separately.
Using Bayes’ rule, we define the posterior distribution of illuminants given the input image as:
We model the likelihood of an observed image for a given illuminant :
where are the surface reflectances and is the image as corrected with illuminant . The term is only non-zero for . The likelihood rates whether a corrected image looks realistic.
We choose to instantiate the model of our likelihood using a shallow CNN. The network should learn to output a high likelihood if the reflectances look realistic. We model the prior probabilityfor each candidate illuminant independently as learnable parameters in an end-to-end approach; this effectively acts as a regularisation, favouring more likely real-world illuminants. We note that, in practice, the function modelling the prior also depends on factors such as the environment (indoor / outdoor), the time of day, ISO etc. However, the size of currently available datasets prevent us from modelling more complex proxies.
In order to estimate the illuminant , we optimise the quadratic cost (minimum MSE Bayesian estimator), minimised by the mean of the posterior distribution:
This is done in the following three steps (c.f. Figure 2):
Candidate selection (Section 3.2): Choose a set of illuminant candidates to generate corrected thumbnail () images.
Likelihood estimation (Section 3.3): Evaluate these images independently with a CNN, a network designed to estimate the likelihood that an image is well white balanced .
Illuminant determination (Section 3.4): Compute the posterior probability of each candidate illuminant and determine a final illuminant estimation .
This formulation allows estimation of a posterior probability distribution, allowing us to reason about a set of probable illuminants rather than produce a single illuminant point estimate (c.f. regression approaches). Regression typically does not provide feedback on a possible set of alternative solutions which has shown to be of high value in alternative vision problems .
The second benefit that our decomposition affords is a principled multi-camera training process. A single, device agnostic CNN estimates illuminant likelihoods and performs independent selection of candidate illuminants for each camera. By leveraging image information across multiple datasets we increase model robustness. Additionally, the amalgamation of small available CC datasets provides a step towards harnessing the power of large capacity models for this problem domain c.f. contemporary models.
3.2 Candidate selection
The goal of candidate selection is to discretise the illuminant space of a specific camera in order to obtain a set of representative illuminants (spanning the illuminant space). Given a collection of ground truth illuminants, measured from images containing calibration objects (i.e. a labelled training set), we compute candidates using -means clustering  on the linear RGB space.
By forming clusters of our measured illuminants, we define the set of candidates as the cluster centers. -means illuminant clustering is previously shown to be effective for color constancy  however we additionally evaluate alternative candidate selection strategies (detailed in the supplementary material); our experimental investigation confirms a simple -means approach provides strong target task performance. Further, the effect of is empirically evaluated in Section 4.4.
Image captured by a given camera, is then used to produce a set of images, corrected using the illuminant candidate set for the camera, on which we evaluate the accuracy of each candidate.
3.3 Likelihood estimation
We model the likelihood estimation step using a neural network which, for a given illuminantand image , takes the tentatively corrected image as input, and learns to predict the likelihood that the image has been well white balanced i.e. has an appearance of being captured under an achromatic light source.
The success of low capacity histogram based methods [7, 6] and the inference-training tradeoff for small datasets motivate a compact network design. We propose a small CNN with one spatial convolution and subsequent layers constituting convolutions with spatial pooling. Lastly, three fully connected layers gradually reduce the dimensionality to one (see supplementary material for architecture details). Our network output is then a single value that represents the log-likelihood that the image is well white balanced:
Function is our trained CNN parametrised by model weights . Eq. 7 estimates the log-likelihood of each candidate illuminant separately. It is important to note that we only train a single CNN which is used to estimate the likelihood for each candidate illuminant independently. However, in practice, certain candidate illuminants will be more common than others. To account for this, following , we compute an affine transformation of our log-likelihood by introducing learnable, illuminant specific, gain and bias parameters. Gain affords amplification of illuminant likelihoods. The bias term learns to prefer some illuminants i.e. a prior distribution in a Bayesian sense: . The log-posterior probability can then be formulated as:
We highlight that learned affine transformation parameters are training camera-dependent and provide further discussion on camera agnostic considerations in Section 3.5.
3.4 Illuminant determination
We require a differentiable method in order to train our model end-to-end, and therefore the use of a simple Maximum a Posteriori (MAP) inference strategy is not possible. Therefore to estimate the illuminant , we use the minimum mean square error Bayesian estimator, which is minimised by the posterior mean of (c.f. Eq. 6):
The resulting vector is -normalised. We leverage our
-means centroid representation of the linear RGB space and use linear interpolation within the convex hull of feasible illuminants to determine the estimated scene illuminant. For Eq. 9, we take inspiration from [29, 38], who have successfully explored similar strategies in CC and stereo regression, e.g.  introduced an analogous soft-argmin to estimate disparity values from a set of candidates. We apply a similar strategy for illuminant estimation and use the soft-argmax which provides a linear combination of all candidates weighted by their probabilities.
We train our network end-to-end with the commonly used angular error loss function, whereand are the prediction and ground truth illuminant, respectively:
3.5 Multi-device training
As discussed in previous work [2, 41, 22], CC models typically fail to train successfully using multiple camera data due to distribution shifts between camera sensors, making them intrinsically device-dependent and limiting model capacity. A device-independent model is highly appealing due to the small number of images commonly available in camera-specific public color constancy datasets. The cost and time associated with collecting and labelling new large data for specific novel devices is expensive and prohibitive.
Our CNN learns to produce the likelihood that an input image is well white balanced. We claim that framing part of the CC problem in this fashion results in a device-independent learning task. We evaluate the benefit of this hypothesis experimentally in Section 4.
To train with multiple cameras we use camera-specific candidates, yet learn only a single model. Specifically, we train with a different camera for each batch, use camera-specific candidates yet update a single set of CNN parameters during model training. In order to ensure that our CNN is device-independent, we fix previously learnable parameters that depend on sensor specific illuminants, i.e. and . The absence of these parameters, learned in a camera-dependent fashion, intuitively restricts model flexibility however we observe this drawback to be compensated by the resulting ability to train using amalgamated multi-camera datasets i.e. more data. This strategy allows our CNN to be camera-agnostic and affords the option to refine existing CNN quality when data from novel cameras becomes available. We however clarify that our overarching strategy for white balancing maintains use of camera-specific candidate illuminants.
4.1 Training details
of 50% is applied after average pooling. We take the log transform of the input before the first convolution. Efficient inference is feasible by concatenating each candidate corrected image into the batch dimension. We use PyTorch 1.0 and an Nvidia Tesla V100 for our experiments. The first layer is the only spatial convolution, it is adapted from 
and pretrained on ImageNet. We fix the weights of this first layer to avoid over-fitting. The total amount of weights is . For all experiments calibration objects are masked, black level subtracted and over-saturated pixels are clipped at threshold. We resize the image to and normalise.
We experiment using three public datasets. The Gehler-Shi dataset [47, 23] contains images of indoor and outdoor scenes. Images were captured using Canon 1D and Canon 5D cameras. We highlight our awareness of the existence of multiple sets of non-identical ground-truth labels for this dataset (see  for further detail). Our Gehler-Shi evaluation is conducted using the SFU ground-truth labels  (consistent with the label naming convention in ). The NUS dataset  originally consists of subsets of 210 images per camera providing a total of images. The Cube+ dataset  contains images captured with Canon 550D camera, consisting of predominantly outdoor imagery.
For the NUS  and Gehler-Shi [47, 23] datasets we perform three-fold cross validation (CV) using the splits provided in previous work [6, 7]. The Cube+  dataset does not provide splits for CV so we use all images for learning and evaluate using a related set of test images, provided for the recent Cube+ ISPA 2019 challenge . We compare with the results from the challenge leader-board.
For the NUS dataset , we additionally explore training multi-camera models and thus create a new set of CV folds to facilitate this. We are careful to highlight that the NUS dataset consists of eight image subsets, pertaining to eight capture devices. Each of our new folds captures a distinct set of scene content (i.e. sets of up to eight similar images for each captured scene). This avoids testing on similar scene content seen during training. We define our multi-camera CV such that multi-camera fold is the concatenation of images, pertaining to common scenes, captured from all eight cameras. The folds that we define are made available in our supplementary material.
4.3 Evaluation metrics
We use the standard angular error metric for quantitative evaluation (c.f. Eq. 10). We report standard CC statistics to summarise results over the investigated datasets: Mean, Median, Trimean, Best , Worst . We further report method inference time in the supplementary material. Other works’ results were taken from corresponding papers, resulting in missing statistics for some methods. The NUS  dataset is composed of
cameras, we report the geometric mean of each statistic for each method across all cameras as standard in the literature[6, 7, 28].
4.4 Quantitative evaluation
Accuracy experiments. We report competitive results on the dataset of Gehler-Shi [47, 23] (c.f. Table 1). This dataset can be considered very challenging as the number of images per camera is imbalanced: There are Canon 1D and Canon 5D images. Our method is not able to outperform the state-of-the-art likely due to the imbalanced nature and small size of Canon 1D. Pretraining on a combination of NUS  and Cube+  provides moderate accuracy improvement despite the fact that the Gehler-Shi dataset has a significantly different illuminant distribution compared to those seen during pre-training. We provide additional experiments, exploring the effect of varying , for -means candidate selection in the supplementary material.
Results for NUS  are provided in Table 2. Our method obtains competitive accuracy and the previously observed trend, pre-training using additional datasets (here Gehler-Shi [47, 23] and Cube+ ), again improves results.
In Table 3, we report results for our multi-device setting on the NUS  dataset. For this experiment we introduce a new set of training folds to ensure that scenes are well separated and refer to Sections 3.5 for multi-device training and 4.2 for related training folds detail. We draw multi-device comparison with FFCC , by choosing to center the FFCC histogram with the training set (of amalgamated camera datasets). Note that results are not directly comparable with Table 2 due to our redefinition of CV folds. Our method is more accurate than the state-of-the-art when training considers all available cameras at the same time. Note that multi-device training improves the median angular error of each individual camera dataset (we provide results in the supplementary material). Overall performance is improved by in terms of median accuracy.
In summary, we observe strong generalisation when using multiple camera training (e.g. NUS  results c.f. Tables 3 and 2).
These experiments illustrate the large benefit achievable with multi-camera training when illuminant distributions of the cameras are broadly consistent.
Gehler-Shi [47, 23] has a very disparate illuminant distribution with respect to alternative datasets and we are likely unable to exploit the full advantage of multi-camera training. We note the FFCC  state of the art method is extremely shallow and therefore optimised for small datasets. In contrast, when our model is trained on large and relevant datasets we are able to achieve superior results.
Run time. Regarding run-time; we measure inference speed at milliseconds, implemented in unoptimised PyTorch (see supplementary material for further detail).
4.5 Training on novel sensors
To explore camera agnostic elements of our model, we train on a combination of the full NUS  and Gehler-Shi [47, 23] datasets. As described in Section 3.5, the only remaining device dependent component involves performing illuminant candidate selection per device. Once the model is trained, we select candidates from Cube+  and test on the Cube challenge dataset . We highlight that neither Cube+ nor Cube challenge imagery is seen during model training. For meaningful evaluation, we compare against both classical and recent learning-based  camera-agnostic methods. Results are shown in Table 5. We obtain results that are comparable to Table 4 without seeing any imagery from our target camera, outperforming both baselines and . We clarify that our method performs candidate selection using Cube+  to adapt the candidate set to the novel device while  does not see any information from the new camera.
We provide additional experimental results for differing values of (-means candidate selection) in the supplementary material. We observe stability for . The low number of candidates required is likely linked to the two Cube datasets having reasonably compact distributions.
4.6 Qualitative evaluation
We provide visual results for the Gehler-Shi [47, 23] dataset in Figure 21. We sort inference results by increasing angular error and sample images uniformly. For each row, we show (a) the input image (b) our estimated illuminant color and resulting white-balanced image (c) the ground truth illuminant color and resulting white-balanced image. Images are first white-balanced, then, we apply an estimated CCM (Color Correction Matrix), and finally, sRGB gamma correction. We mask out the Macbeth Color Checker calibration object during both training and evaluation.
Our most challenging example (c.f. last row of Figure 21) is a multi-illuminant scene (indoor and outdoor lights), we observe our method performs accurate correction for objects illuminated by the outdoor light, yet the ground truth is only measured for the indoor illuminant, hence the high angular error. This highlights the limitation linked to our single global illuminant assumption, common to the majority of CC algorithms. We show additional qualitative results in the supplementary material.
We propose a novel multi-hypothesis color constancy model capable of effectively learning from image samples that were captured by multiple cameras. We frame the problem under a Bayesian formulation and obtain data-driven likelihood estimates by learning to classify achromatic imagery. We highlight the challenging nature of multi-device learning due to camera color space differences, spectral sensitivity and physical sensor effects. We validate the benefits of our proposed solution for multi-device learning and provide state-of-the-art results on two popular color constancy datasets while maintaining real-time inference constraints. We additionally provide evidence supporting our claims that framing the learning question as a classification taskc.f. regression can lead to strong performance without requiring model re-training or fine-tuning.
What else can fool deep learning? addressing color constancy errors on deep neural network performance. In 2019 IEEE International Conference on Computer Vision, ICCV 2019, Seoul, Korea, October 29-November 1, 2019, Cited by: §1.
-  (2019) Sensor-Independent Illumination Estimation for DNN Models. In Proceedings of the British Machine Vision Conference 2019, BMVC 2019, Cardiff University, Cardiff, UK, September 9-12, 2019, Cited by: §1, §1, §2, §3.5, §4.5, Table 1, Table 2, Table 4, Table 5.
When color constancy goes wrong: correcting improperly white-balanced images.
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 1535–1544. External Links: Cited by: Table 4.
-  (2012) On sensor bias in experimental methods for comparing interest-point, saliency, and recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1), pp. 110–126. External Links: Cited by: §1.
-  (2018) Unsupervised learning for color constancy. In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 4: VISAPP, Funchal, Madeira, Portugal, January 27-29, 2018, pp. 181–188. External Links: Cited by: Table 8, item 4, §4.2, §4.2, §4.4, §4.4, §4.5, Table 5.
-  (2017) Fast fourier color constancy. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6950–6958. External Links: Cited by: Table 11, Table 12, Appendix E, Figure G.21, Figure G.42, Appendix G, A Multi-Hypothesis Approach to Color Constancy: supplementary material, Figure 1, §1, §2, §2, §3.3, §3.3, §4.2, §4.3, §4.4, §4.4, Table 1, Table 2, Table 3, Table 4.
-  (2015) Convolutional color constancy. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 379–387. External Links: Cited by: Table 11, Appendix G, §1, §2, §2, §3.3, §4.2, §4.3, Table 1, Table 2.
-  (2015) Color constancy using cnns. In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2015, Boston, MA, USA, June 7-12, 2015, pp. 81–89. External Links: Cited by: §1, §2.
Single and multiple illuminant estimation using convolutional neural networks. IEEE Transactions on Image Processing 26 (9), pp. 4347–4362. External Links: Cited by: §1, §2.
-  (2019) Quasi-unsupervised color constancy. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 12212–12221. External Links: Cited by: §2, Table 1, Table 2.
-  (1986) Analysis of the retinex theory of color vision. JOSA A 3 (10), pp. 1651–1661. Cited by: Table 1, Table 2.
-  (1980) A spatial processor model for object colour perception. Journal of the Franklin institute 310 (1), pp. 1–26. Cited by: §2, Table 1, Table 2, Table 4, Table 5.
-  (2018) Modeling camera effects to improve deep vision for real and synthetic data. CoRR abs/1803.07721. External Links: Cited by: §1.
-  (2014) Illuminant estimation for color constancy: why spatial-domain methods work and the role of the color distribution. JOSA A 31 (5), pp. 1049–1058. Cited by: Table 8, Table 9, Appendix C, Figure G.21, Figure G.42, Appendix G, A Multi-Hypothesis Approach to Color Constancy: supplementary material, Figure 1, item 4, §4.2, §4.2, §4.2, §4.3, §4.4, §4.4, §4.4, §4.4, §4.4, §4.5, Table 2, Table 3, Table 5.
-  (2015) Effective learning-based illuminant estimation using simple features. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 1000–1008. External Links: Cited by: Table 11, Table 1, Table 2.
-  (2009) ImageNet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pp. 248–255. External Links: Cited by: §4.1.
-  (2017) Dirty pixels: optimizing image classification architectures for raw sensor data. CoRR abs/1701.06487. External Links: Cited by: §1.
-  (2004) Shades of gray and colour constancy. In The Twelfth Color Imaging Conference: Color Science and Engineering Systems, Technologies, Applications, CIC 2004, Scottsdale, Arizona, USA, November 9-12, 2004, pp. 37–41. External Links: Cited by: §2.
-  (1995) Bayesian decision theory, the maximum local mass estimate, and color constancy. In Procedings of the Fifth International Conference on Computer Vision (ICCV 95), Massachusetts Institute of Technology, Cambridge, Massachusetts, USA, June 20-23, 1995, pp. 210–217. External Links: Cited by: §2, §3.1.
-  (2010) The rehabilitation of maxrgb. In 18th Color and Imaging Conference, CIC 2010, San Antonio, Texas, USA, November 8-12, 2010, pp. 256–259. External Links: Cited by: §2.
-  (2004) Estimating illumination chromaticity via support vector regression. In The Twelfth Color Imaging Conference: Color Science and Engineering Systems, Technologies, Applications, CIC 2004, Scottsdale, Arizona, USA, November 9-12, 2004, pp. 47–52. External Links: Cited by: §2.
-  (2017) Improving color constancy by discounting the variation of camera spectral sensitivity. JOSA A 34 (8), pp. 1448–1462. Cited by: §1, §3.5.
-  (2008) Bayesian color constancy revisited. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24-26 June 2008, Anchorage, Alaska, USA, External Links: Cited by: Table 7, Table 8, Appendix B, Table 11, Table 12, Appendix E, item 4, §2, §3.1, Figure 21, §4.2, §4.2, §4.4, §4.4, §4.4, §4.4, §4.5, §4.6, Table 1, Table 2, Table 5.
-  (2009) Perceptual analysis of distance measures for color constancy algorithms. JOSA A 26 (10), pp. 2243–2256. Cited by: §1.
-  (2019) Convolutional mean: a simple convolutional neural network for illuminant estimation. In Proceedings of the British Machine Vision Conference 2019, BMVC 2019, Cardiff University, Cardiff, UK, September 9-12, 2019, Cited by: Table 11, §1, §2, Table 1, Table 2.
-  (2018) Rehabilitating the colorchecker dataset for illuminant estimation. In Color and Imaging Conference, Vol. 2018, pp. 350–353. Cited by: §4.2.
-  (2012) Improving neural networks by preventing co-adaptation of feature detectors. CoRR abs/1207.0580. External Links: Cited by: §4.1.
-  (2017) FC^4: fully convolutional color constancy with confidence-weighted pooling. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 330–339. External Links: Cited by: Table 11, §1, §2, §4.3, Table 1, Table 2.
-  (2017) End-to-end learning of geometry and context for deep stereo regression. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 66–75. External Links: Cited by: §3.4.
-  (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §4.1.
-  ISPA 2019 Illumination Estimation Challenge. Note: https://www.isispa.org/illumination-estimation-challengeAccessed November 14, 2019 Cited by: Table 8, Table 10, Appendix D, §4.2, §4.4, §4.5, Table 4, Table 5.
-  (1971) Lightness and retinex theory. Josa 61 (1), pp. 1–11. Cited by: §2.
-  (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory 28 (2), pp. 129–136. External Links: Cited by: Figure 2, §2, §3.2, §4.1.
-  (2015) Color constancy by deep learning. In Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, pp. 76.1–76.12. External Links: Cited by: §1, §2.
-  (2018) A mixed classification-regression framework for 3d pose estimation from 2d images. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018, pp. 72. External Links: Cited by: §1, §1, §3.1.
-  (2019) Explaining the ambiguity of object detection and 6d pose from visual data. 2019 IEEE International Conference on Computer Vision, ICCV 2019, Seoul, Korea, October 29-November 1, 2019. Cited by: §1.
-  (2018) Meta-learning for few-shot camera-adaptive color constancy. CoRR abs/1811.11788. External Links: Cited by: §2, Table 1, Table 2.
-  (2017) Approaching the computational color constancy as a classification problem through deep learning. Pattern Recognition 61, pp. 405–416. External Links: Cited by: Appendix D, §1, §2, §3.2, §3.4, Table 1, Table 2.
-  (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 8024–8035. External Links: Cited by: Appendix E, §4.1.
-  (2019) Fast fourier color constancy and grayness index for ISPA illumination estimation challenge. In 11th International Symposium on Image and Signal Processing and Analysis, ISPA 2019, Dubrovnik, Croatia, September 23-25, 2019, pp. 352–354. External Links: Cited by: Table 4.
-  (2014) Raw-to-raw: mapping between image sensor color responses. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pp. 3398–3405. External Links: Cited by: §1, §3.5.
-  (2016) You only look once: unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 779–788. External Links: Cited by: §1.
-  (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. External Links: Cited by: §1.
-  (2003) Bayesian color constancy with non-gaussian models. In Advances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], pp. 1595–1602. External Links: Cited by: §2, §3.1.
-  (2017) Learning in an uncertain world: representing ambiguity through multiple hypotheses. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 3611–3620. External Links: Cited by: §1.
-  (2019) Color cerberus. In 11th International Symposium on Image and Signal Processing and Analysis, ISPA 2019, Dubrovnik, Croatia, September 23-25, 2019, pp. 355–359. External Links: Cited by: Table 4.
-  Re-processed version of the gehler color constancy dataset. Note: https://www2.cs.sfu.ca/~colour/data/shi_gehler/Accessed November 14, 2019 Cited by: Table 7, Table 8, Appendix B, Table 11, Table 12, Appendix E, item 4, Figure 21, §4.2, §4.2, §4.4, §4.4, §4.4, §4.4, §4.5, §4.6, Table 1, Table 5.
-  (2016) Deep specialized network for illuminant estimation. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, pp. 371–387. External Links: Cited by: §1, §2, Table 1, Table 2.
-  (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §4.1.
-  (2007) Edge-based color constancy. IEEE Transactions on Image Processing 16 (9), pp. 2207–2214. External Links: Cited by: §2, Table 4, Table 5.
-  (1905) Influence of adaptation on the effects produced by luminous stimuli. handbuch der Physiologie des Menschen 3, pp. 109–282. Cited by: §3.
-  (2009) Edge-based color constancy via support vector regression. IEICE Transactions on Information and Systems 92-D (11), pp. 2279–2282. External Links: Cited by: §2.
-  (2006) Estimating illumination chromaticity via support vector regression. Journal of Imaging Science and Technology 50 (4), pp. 341–348. Cited by: §2.
A Multi-Hypothesis Approach to Color Constancy: supplementary material
We provide additional material to supplement our main paper. In Appendix A, we present our shallow CNN architecture. Two experimental studies on the number of illuminant candidates are provided in Appendix B. In Appendix C, we report details on NUS  per-camera median angular error to provide evidence for our claim that we consistently improve accuracy for each camera, using multi-camera training (see main paper Section ). In Appendix D, we show additional results from our exploration of candidate selection strategy. Appendix E provides run-time measurements and in Appendix F we observe failure cases and discuss limitations of our method. Finally, Appendix G provides additional visual results comparing our method with FFCC .
Appendix A Architecture details
In Table 6, we present our CNN architecture. We propose a shallow CNN, one spatial convolution and two subsequent layers constituting convolutions with a final global spatial pooling. Lastly, three fully connected layers gradually reduce the dimensionality to one.
CNN architecture details. Fully connected layers and convolutions are followed by a ReLU activation except the last layer.
Appendix B Number of illuminant candidates
In Table 7 we present a study varying the number of candidate illuminants produced by -means. We find experimentally that accuracy improves with the number of cluster centres until a plateau is reached, suggesting that we need candidate illuminants to achieve competitive angular error for the Gehler-Shi dataset [47, 23].
Additionally, we provide analogous results for different values of for -means candidate selection for the training-free model (see main paper Section 4.5), in Table 8. We observe stability for . The low number of candidates required is likely linked to the two Cube datasets having reasonably compact illuminant distributions.
Appendix C NUS per-camera median angular error
We provide evidence supporting our paper claim that training the proposed model with images from multiple cameras outperforms individual, per-camera, model training (see Section , of the main paper).
We reiterate that folds are divided such that scene content is consistent within a fold, across all cameras. This ensures to avoid testing on familiar scene content, as observed by a different camera during training. Towards reproducibility, and fair comparison, our suppplementary material provides the cross validation (CV) splits, used in the main paper, for multi-device training. CV splits were generated manually by ensuring that all images of the same scene (across different cameras) belong to the same fold.
Appendix D Candidate selection methods
We report additional illuminant candidate selection strategies explored during our investigation.
Uniform-sampling: we consider the global extrema of our measured illuminant samples (max. and min. in each color space dimension) and sample points uniformly using an [, ] color space. These samples constitute our illuminant candidates.
-means clustering: cluster centroids define candidates, as detailed in the main paper, Section and other recent color constancy work . We use RGB color space for clustering, and experimentally verified that both [, ] and RGB color spaces provided similar accuracy.
Mixture Model (GMM): we fit a GMM to our measured illuminant samples in [, ] color space, and then draw samples from the GMM to define illuminant candidates.
We use candidates ( grid) for uniform candidate selection. For GMM candidate selection, we fit
two-dimensional Gaussian distributions and samplecandidates.
In Table 10 we report inference performance on the Cube challenge  data set using the described candidate selection strategies. We observe that simple uniform-sampling candidate selection performs reasonably well. The strategy provides an extremely simple implementation yet, by definition, will also sample some portion of very unlikely candidates. We note, however, that if the interpolation between candidates span the illuminant space, our method can learn to interpolate these candidates appropriately, accounting for this. The GMM approach also results in slightly weaker accuracy performance c.f. -means, motivating our choice of sampling strategy in the experimental work for the main paper.
Appendix E Inference run-time
We report inference run-time results for the Gehler-Shi dataset [47, 23] in Table 11. We note that our real-time inference speed is obtained using a Nvidia Tesla V100 card and unoptimised implementation (PyTorch 1.0 ). We highlight that our algorithm is highly parallelizable, each illuminant candidate likelihood can be computed independently, however, we obtain the run-time with single-thread implementation. Our input image resolution is and timing results are recorded using -means candidate selection with . The timing performance of other methods are obtained from their respective citations. We acknowledge that timing comparisons are non-rigorous; reported run-times are measured using differing hardware. To provide additional fair comparison; Table 12 reports run-times for both our method and the official111https://github.com/google/ffcc FFCC  implementation run on Matlab R2019b, under common hardware (Intel Core i9-9900X (3.50GHz)).
Appendix F Failure cases
In Figures F.10, F.7 and F.4 we provide observed limitations and failure cases. Our method learns to interpolate between candidate illuminants, that are observed during training, but not to extrapolate to new illuminants. In Figure (c)c, the ground truth illuminant (green filled circle) is clearly out of distribution, with no similar candidate illuminants observed during training. The resulting inference accuracy in Figure (a)a suffers as a result.
Further, our single global illuminant assumption can be seen to be violated in Figure F.7. The predicted illuminant attempts to balance the outer boundary portions of the wall painting as achromatic, clearly illuminated from above (out of shot). The measured ground truth illuminant captures the desk lamp illumination, resulting in high angular error for this image due to the global assumption.
Finally, in Figure F.10, we observe an example scene with extreme ambiguities. Our method appears to infer that the stone building in the scene background is achromatic, producing a highly plausible image. Yet the measured ground-truth illuminant illustrates the true building color to be of mild beige-yellow.
Appendix G Additional qualitative results
In Figure G.21, we provide additional qualitative results in the form of test images from the NUS  dataset (Sony camera). For each test sample we show the input image and a white-balanced image, corrected using the ground-truth illumination in addition to the output of our model (“multi-device training + pretraining”), and that of FFCC (model Q) . Each row consists of: (a) the input image (b) FFCC  (c) our prediction (d) ground truth.
In similar fashion to , we adopt the strategy of sorting test images by the combined mean angular-error of the two evaluated methods. We present images of increasing average difficulty, sampled with a uniform spacing. Images are corrected by inferred illuminants, applying an estimated CCM (Color Correction Matrix), and standard sRGB gamma correction. The Macbeth Color Checker is used to generate the ground-truth and is present in the images, however the relevant regions are masked during both training and inference. It can be observed in Figure G.21 in almost all sampled cases, we see consistently improved results with our approach.
We provide further extremely challenging examples in Figure G.42. We explicitly select the five largest combined mean angular-error images. We observe that our method shows consistently strong performance and also highlight that these samples constitute cases of both ambiguous and multi-illuminant scenes, breaking the fundamental global illuminant assumption (made by both methods).