Meta-Learning for Few-shot Camera-Adaptive Color Constancy

11/28/2018 ∙ by Steven McDonagh, et al. ∙ HUAWEI Technologies Co., Ltd. 0

Digital camera pipelines employ color constancy methods to estimate an unknown scene illuminant, enabling the generation of canonical images under an achromatic light source. By taking advantage of large amounts of labelled images, learning-based color constancy methods provide state-of-the-art estimation accuracy. However, for a new sensor, data collection is typically arduous, as it requires both imaging physical calibration objects across different settings (such as indoor and outdoor scenes), as well as manual image annotation to produce ground truth labels. In this work, we address sensor generalisation by framing color constancy as a meta-learning problem. Using an unsupervised strategy driven by color temperature grouping, we define many related, yet distinct, illuminant estimation tasks, aggregating data from four public datasets with different camera sensors and diverse scene content. Experimental results demonstrate it is possible to produce a few-shot color constancy method competitive with the fully-supervised, camera-specific state-of-the-art.



There are no comments yet.


page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

An essential component of digital photography is color constancy, which accounts for the effect of scene illumination to enable natural image appearance. Accurate recording of intrinsic scene color is of great importance for many practical high-level computer vision applications including image classification, semantic segmentation and machine vision quality control 

[43, 56, 24]. Such applications commonly require that input images are device independent and color-unbiased. However captured image colors are always affected by the prevailing light source color in a scene. Extraction of the intrinsic color information from scene surfaces by compensating for scene illuminant color is commonly referred to as “color constancy” (CC) or “automatic white balance” (AWB).

The process of computational CC can be defined as the transformation of a source image, captured under an unknown illuminant, to a target image representing the same scene under a canonical illuminant (white light source). CC algorithms typically consist of two stages; first, estimation of the scene illuminant color and second, transformation of the source image, accounting for the illuminant, such that the resulting image illumination appears achromatic.

(a) Pre-white balanced images, captured by a single camera (Canon1D; NUS-9 dataset [16]), marking their respective ground-truth gain corrections in space. Image frame border colors indicate corresponding image temperature bins.
(b) Ground-truth gain corrections for images observing identical scenes under similar illumination yet captured with distinct cameras (Sony, Canon1D; NUS-9 dataset [16]).
Figure 1: Our meta-task definition is conditioned on both image temperature and camera, motivated by the expected image light source separability and CSS distribution shifts.

The color of a surface is determined by both surface reflectance properties and by the spectral power distribution of the light(s) illuminating it. Variations in scene illumination therefore change the color of the surface appearance in an image. This combination of properties makes the problem underdetermined. Explicitly, the three physical factors, consisting of intrinsic surface reflectance, illuminant light color and camera spectral sensitivity (CSS) are collectively unknown and should be estimated. However in practice we only observe a product of these factors, as measured in the digital image. More formally, we model a tri-chromatic photosensor response in the standard way such that:


where is the intensity of color channel at pixel location , the wavelength of light such that represents the spectrum of the illuminant, the surface reflectance at pixel location and the CSS for channel , considered over the spectrum of visible wavelengths . The goal of computational CC then becomes estimation of the global illumination color where:


Finding for each in Eq. (2) is ill-posed due to the infinitely many combinations of illuminant color and surface reflectance that result in the same image value.

Work on single image illuminant color estimation can broadly be divided into statistics-based and learning-based methods  [31]. Classical methods utilise low-level statistics to realise various instances of the gray-world assumption: the average reflectance in a scene under a neutral light source is achromatic. Gray-World [15] and its extensions [20, 54] are based on these assumptions that tie scene reflectance statistics (eg. mean, max reflectance) to the achromaticity of scene color. Related assumptions define perfect reflectance [37, 25] and result in White-Patch methods. Statistical methods are fast and typically contain few free parameters, however performance is highly dependent on strong scene content assumptions and these methods falter in cases where assumptions fail to hold.

Learning-based methods comprise combinational and direct approaches. The former apply optimal combinations of statistical methods to the input image, based on the observed scene [38]

. Such approaches work better than unitary methods in general, however result quality depends on the considered unitary methods, whose output they combine. The latter group of learning-based approaches learn output directly from training data. Previous work has made use of traditional machine learning methods

[26, 45, 28], relying on hand-crafted image features. Recent learning-based work provides both high accuracy and speed performance now approaching that of statistical methods [17, 10].

Notable recent deep-learning based color constancy approaches are supervised in nature and typically require large amounts of calibrated and hand-labelled sensor specific data to learn robust models for each target device 

[4]. Collection and calibration of imagery for supervised training approaches to the color constancy problem is expensive and restrictive, commonly requiring placement of a physical calibration object in the scene where images are to be captured and, subsequently, to accurately segment the object in image space to extract ground-truth illuminant information. This combination of assiduous image acquisition, large amount of manual work, and accurate yet data-hungry deep-learning approaches are the main barriers preventing fast, efficient and cheap supervised training of models capable of providing highly accurate and robust illuminant estimation to new device sensors.

In this paper we propose a new approach to mitigate the costly acquisition of large amounts of labelled, sensor-specific CC data. Inspired by recent progress in meta-learning, our proposed approach benefits from the accuracy of modern deep learning methodology without the typical associated data acquisition costs. We apply a few-shot learning strategy to the CC problem to learn camera-specific color biases from only a few target-device labelled images. We rely on the meta-learning approach of [21] which frames the few-shot learning problem at two levels: a quick acquisition of knowledge from within specific tasks, guided by the second process, which involves slower extraction of information learned across a distribution of tasks.

Contributions. We address the computational CC problem with a meta-learning strategy, using only few training images and allowing fast adaptation to new unseen data and camera sensors. We leverage the information contained in multiple datasets by defining camera and illuminant color specific tasks through an unsupervised strategy driven by image color temperature. Taking the MAML algorithm [21]

as our base meta-learner, we train deep convolutional neural networks to regress illuminant corrections and quickly adapt to our large variety of tasks. Our experiments on four public datasets, consisting of images captured from

distinct cameras, show that using only simple network architectures with only to training images we obtain results competitive with the fully-supervised state-of-the-art. We provide a comparative analysis of the influence of key parameters and meta-learning methodology on performance, and show that our Meta-AWB approach is able to accurately estimate the illuminant even when camera sensitivity functions and camera balance are unknown a priori.

2 Related Work

Auto White Balance. Recent convolutional AWB work has considered both local patch-based [11, 50, 12, 33] and full image [9, 40, 10] input. The work of [11, 12] constitutes the first attempt to adopt a patch-based CNN for illuminant estimation. Shi et al. [50] propose an architecture involving interacting sub-networks, capable of generating multiple illuminant hypotheses to account for estimation ambiguities. Further recent work [33] uses learned confidence weights to select image regions with high semantic value for patch illuminant estimation.

In contrast to patch-based work, Lou et al. [40] use entire images as input, proposing a method using global image context. The semantic value of local image regions becomes arguably difficult to attribute when using full images and, importantly, such fully-supervised approaches highlight the limits of small CC datasets with insufficient labelled data. Image augmentation (eg. synthetic relighting [12]

) and transfer-learning 

[11] using models pre-trained for alternative tasks (eg.ImageNet [36]), have been employed to mitigate the small data issue. The former strategy commonly struggles to successfully generalise to real-world image manifolds at inference time. The latter offers the appealing properties of faster convergence and prevention of overfitting to a small dataset, however the misalignment between object classification and computational CC likely results in learning features invariant to appearance attributes of critical importance for CC. Barron [9] alternatively frame computational CC as a 2D spatial localisation problem. They represent image data using log-chroma histograms for which a single convolutional layer learns to evaluate illuminant color solutions in the chroma plane. The work is extended in [10] by performing the localisation task in the frequency domain, improving both speed and accuracy.

Inter-camera and unsupervised approaches. Few color constancy works have attempted to mitigate the costs of sensor-specific data collection, calibration and image labelling. Early unsupervised work [53] introduces a linear statistical model that is learned by observing how image colors vary jointly under common lighting changes. The model is learned for a single sensor using a large number of video frames, captured over an extended period. More recently, the work of [27] learns a transform matrix between distinct pairs of device CSS that is used to transform images and illuminant ground-truth, exhibiting the color biases of the first sensor to that of the second. Using this transform, models can be trained for the second sensor without acquiring an additional sensor-specific image dataset. An unsupervised technique is proposed in [6] using classical statistical approaches to learn parameters that approximate the unknown ground-truth illumination of the training images, avoiding calibration and image labelling.

Meta-Learning. The concept of meta-learning [47] can be thought of as an ability to comprehend and adapt one’s own learning process and strategy on a level above that of acquiring task specific knowledge. Systems with this capability of learning-to-learn [52] can evaluate, adjust and improve their learning strategy according to both experience gained previously and to the requirements of the task at hand. In the context of machine-learning, [14] defines meta-learning as the process of (1) acquiring task specific knowledge during base-learning (a single application of a learning system) and (2) accumulating experience acquired during multiple previous applications of the base-learning system. Contemporary meta-learning covers a relatively broad spectrum of work and considers approaches enabling optimizer learning [2]

, hyperparameter optimisation 

[18], architecture search [57] and few-shot learning, among others. A recent survey is found in [41]. Few-shot learning problems, in particular, consist of learning a new task or concept using only a handful of data points (typically - samples per task) and have recently received considerable attention [55, 44, 51, 21]. This class of meta-learning promises a number of advantages with regard to efficient model building for new tasks; reducing the need for data acquisition and labelling by order(s) of magnitude, and deceasing effort spent on fine-tuning and adaptation of existing models to novel problems. A popular strategy, applicable to few-shot learning, consists of finding model initialisations that allow fast adaptation to new, previously unseen tasks. Finn et al. propose this idea and introduce Model Agnostic Meta-Learning (MAML) [21] for few-shot classification and regression tasks. The strategy has since been adopted for tasks ranging from classification [35]

to imitation learning 

[22] and several recently proposed extensions report increases to efficiency [42] and performance [39, 3].

3 Method

An overview of our proposed Meta-AWB method is shown in Fig. 2. In Section 3.1 we describe our simple deep model for scene illuminant regression, in Section 3.2 we introduce our task definition approach and meta-learning algorithms considered for few-shot inter-camera learning, and in Section 3.3 we describe our data and preprocessing.

Figure 2: Overview of the proposed strategy defining task distribution . Considering a set of cameras and camera specific images, we separate images into subtasks based on illuminant color. This is done by computing color temperature for each image, and building a CCT histogram for each camera. Images in the same task are defined as images captured using the same camera and belonging to the same CCT histogram bin.

3.1 Scene illuminant regression

Let us consider an RGB image that has been captured with camera

under a light source of unknown color. Our objective is to estimate a global illuminant correction vector

such that the corrected image appears identical to a canonical image (ie. an image captured under a white light source). While a scene may contain multiple illuminants, here we follow the standard simplifying assumption and seek a single global illuminant correction per image. Similar to contemporary learning based approaches we adopt a deep learning strategy. We cast illuminant estimation as a regression task , where is a nonlinear function described by a neural network model. The model’s parameters are learned and optimised by minimising the well-known angular error loss:


Angular error provides a standard metric sensitive to the inferred orientation of the vector yet agnostic to its magnitude, providing independence to the brightness of the illuminant, while comparing its color.

Due to the discussed limits on typical dataset size per camera, we adopt a simple architecture comprising of four convolutional layers (all sized ), an average pooling layer and two fully connected layers (sizes ,

) all with ReLU activations, except for the last layer.

Given a dataset of images , comprising of images acquired using a single camera , under varying illumination conditions, one can learn to regress global illuminants. Nonetheless, while there are many publicly available datasets, most comprise a relatively small number of images and are camera-specific. This limits the performance of deep learning techniques and typically necessitates aggressive data augmentation and/or pre-training on only quasi-related tasks. In addition, device-specific Camera Spectral Sensitivities (CSS) affect the color domain of captured images and the recording of scene illumination. Images captured by different cameras can therefore exhibit ground-truth illuminant distributions that occupy differing regions of the chromaticity space [27], as can be observed in Fig. 1(b). Intuitively, this means that two images of the same scene and illuminant will have different illuminant corrections if taken by different cameras. As a result, model performance can be limited when training/testing on images acquired from multiple cameras.

3.2 Few-shot inter-camera learning

We aim to leverage multiple datasets to train a model that is robust to CSS. To this end, we borrow ideas from meta-learning and few-shot learning classification approaches. We employ the MAML [21] algorithm for inter-camera illuminant regression. MAML aims to learn an optimal model initialisation from a large set of tasks so that it can adapt quickly at test time to a new unseen task.

Color temperature based task definition. We now consider a set of datasets , with for distinct image sensors. Our objective is to partition all images into a set of tasks such that a) tasks are distinct and diverse, yet numerous enough to learn a good model initialisation, and b) a task contains samples with a level of homogeneity that yields good performance when fine-tuning the model using only a few training images.

Defining tasks for few-shot classification is relatively straightforward, with a set of classes naturally defining a task set. In contrast, this paper considers a regressive problem of estimating an illuminant correction vector. In this context, a naïve approach might be to define a camera as a task. However task count then increases only linearly as new cameras are added and this would require a substantial number of individual sensors to provide enough training tasks. In addition, we expect to observe large variability in illuminant correction within one camera dataset, due to both scene and light source diversity. Achieving good performance and efficient generalisation using tasks that contain too wide intra-task diversity may be difficult, especially when camera-specific models will be fine-tuned in a few-shot setting. We therefore associate each camera with a set of tasks in which the illuminant corrections are clustered.

Gamut based color constancy methods [23, 19, 7] assume that the color of the illuminant is constrained by the colors observed in the image. We make a similar hypothesis when defining our subtasks and aim to regroup images with similar dominant colors in the same task. Color temperature (CT) is a common measurement in photography, often used in high-end camera softwares to describe the color of the illuminant for setting white balance [34]. By definition CT measures, in degrees Kelvin, the temperature that is required to heat a Planckian (or black body) radiator to produce light of a particular color. A Planckian radiator is defined as a theoretical object that is a perfect radiator of visible light [48]. The Planckian locus, illustrated in Fig. 3, is the path that the color of a black body radiator would take in chromaticity space as the temperature increases, effectively illustrating all possible color temperatures.

Figure 3: (a) Chromaticity space with Planckian locus. (b) Color temperature chart and types of light source associated with specific temperatures.

In practice, the chromaticity of most light sources is off the Planckian locus, so the Correlated Color Temperature (CCT) is computed. CCT is the point on the Planckian locus closest to the non-Planckian light source [48, 46]. Intuitively, CCT describes the color of the light source and can be approximated from photos taken under this light. As shown in Fig. 3, different temperatures can be associated with different types of light [34].

For each image, we compute CCT as described in [32]:


with , and , , , constants as defined in [32]. Variables are coordinates in the chromaticity space which can easily be estimated from the image’s RGB values [46]. Finally, we compute a histogram containing bins of CCT values for camera and define each task as containing the set of images in each histogram bin. As a result, we define a task as: where is the camera used to acquire image , and , are the edges of bin in histogram . Intuitively, images within the same temperature bin will have a similar dominant color, and therefore one could expect them to have similar illuminant corrections. Figure 3

highlights that a large variety of light sources are defined by relatively low temperatures. Accounting for this non-uniform distribution, we define bin edges of

as a partition of temperature values on a logarithmic scale. In particular, when setting , we expect to separate images under a warm light source from images under a cold light source (eg. indoor images vs. outdoor images). An overview of the task definition strategy is shown in Fig. 2, Task Definition box.

MAML algorithm. We propose to use the Model Agnostic Meta Learning approach [21] to train our regression model to adapt quickly to multiple tasks and cameras. MAML is an iterative algorithm that trains a model on multiple tasks, aiming to find optimal initialisation parameters that allow good performance to be reached in only a few gradient updates, using only a small number of training samples for new unseen tasks.

Considering our set of tasks as defined above, each MAML iteration samples a batch of tasks . For each task, meta-training images are randomly sampled and used to train model with original parameters for standard gradient descent updates. The model’s parameters are updated to be task-specific parameters :


where is the learning rate parameter and

is the regression loss function as described in Eq. 

3. Finally, a new set of meta-test images are sampled from the same task . For each task in our batch, we compute the meta-test loss function using the task specific updated parameters. We can finally update the global parameters as:


where is the meta-learning rate parameter. We refer to the task specific parameter update (Eq. 5) as the inner update, and to the global update (Eq. 6) as the outer or meta-update.

At test time, parameters are optimised for a new unseen task using Eq. 5 after gradient updates and training samples. We finally compute the illuminant correction for each test image as .

We investigate the performance of three variants of MAML, characterised by different definitions of the learning rate . The first one, as described in the original MAML algorithm [21], defines as a constant set manually (MAML). Secondly, we consider the metaSGD and LSLR approaches [39, 3], which define as a learnable parameter, allowing the direction and magnitude of the gradient descent to be learned. In metaSGD [39], we learn an value per parameter in the network, meaning that and have the same size, consequently doubling the number of trainable parameters. MetaSGD has been observed to yield improved results, but the large number of parameters has the potential to make training difficult with limited training data. Finally, we consider LSLR [3] which also learns an vector but substantially reduces the number of trainable parameters (cf. metaSGD), as LSLR learns a single parameter per layer in the network for each inner gradient update.

3.3 Datasets and preprocessing

Four public color constancy datasets: Gehler-Shi [49, 28], NUS-9 [16], Cube [6] and Funt HDR [25] are combined to investigate the capabilities of the proposed methodology. We use a total of images captured by different cameras. For each dataset, we make use of the provided ‘almost-raw’ PNG images for all experimental work that follows in Section 4. Ground-truth illumination is measured by Macbeth Color Checker (MCC) in all datasets except for the ‘Cube’ database that alternatively uses a SpyderCube [1] calibration object. We make the common assumption that each image contains a dominant global scene illuminant, and use the provided ground-truth measurement for Gehler-Shi, NUS-9 and Cube.

Gehler-Shi: The dataset [49, 28] contains images of indoor and outdoor scenes. Images were captured using Canon 1D and Canon 5D cameras.

NUS-9: The NUS 9-Camera dataset [16] consists of 9 subsets of 210 images per camera providing a total of 111The NUS dataset has recently been updated to include 117 additional images from a ninth camera. During training we use all nine cameras. images. All subsets comprise images representing the same scene, highlighting the influence of the camera sensor.

Cube: The ‘Cube’ dataset [6], containing images, is a recently provided CC resource containing exclusively outdoor imagery, captured using a Canon EOS 550D camera.

Funt HDR: The HDR image dataset captured by Funt and Shi [25] contains images that were captured using a Nikon D700. We make use of the provided -bit PNG versions of the raw imagery. The dataset contains subsets of (up to) images, captured over short time frames (5fps), for each of scenes. Most subsets contain image where the MCC is absent. We choose a single representative image for each scene in the dataset, fixing exposure, shutter speed and aperture and discard scenes where all scene images contain the MCC. The Funt HDR dataset contributes distinct image samples to our accumulated total from which we define the meta task distribution . Multiple MCCs are present in each scene, and following our global illuminant assumption, we approximate a dominant illuminant using the median measured illumination.

Data preprocessing: Camera specific black-level corrections are applied in keeping with offsets specified in the dataset descriptions and we normalize network input to , considering appropriate image bit depths. Gehler-Shi, NUS-9 and Funt HDR datasets are masked using provided MCC coordinates and Cube using the fixed SpyderCube image location with value . Illuminant ground-truth information is masked in all images, containing a calibration object, during both learning and inference stages.

4 Results

Implementation. We evaluate our model using leave-one-out cross validation for the cameras, ie. for each camera we train a model using random image crops from the remaining cameras. Due to the makeup of our proof-of-concept amalgamated dataset, the majority of cameras capture similar scene content (NUS-9 imagery) and, since we aim to optimise the ability to learn a generalisable solution between datasets without overfitting to particular scenes, we choose the Gehler-Shi (Canon 5D) images as a validation set. This allows for optimisation of model hyperparameters using imagery containing unique scene content. We train our CNN model for iterations, used a meta-batch size of , number of training images , meta-learning rate . The update learning rate is set to for MAML, and randomly initialised between and for metaSGD and LSLR. We use batch normalisation on all convolutional layers. At inference time, for each test image, we randomly select training samples (from the test image task) and fine-tune the model for iterations. To evaluate the statistical robustness of our method to variation in the selection of images, we independently repeat draws for each test image. We report the median angular error over all images, averaged over all draws.

Task definition. To explore the validity of our meta-task definition, we plot CCT histograms and respective ground-truth gain corrections per image, relative to CCT bin assignment, in RGB space. Figure 4 provides an example of these for the NUS Canon 600D image set and we show CCT histograms for bins. Correlating bin edges to the temperature chart found in Figure 3, we see that images can be separated based on light type, and more specifically, indoor vs outdoor light sources. This observation is confirmed by the CCT histogram obtained from the Cube dataset (Fig. 4(a)), where all images contain outdoor scenes and are essentially contained in one bin. Figure 4 also illustrates that our CCT-histograms generate homogeneous tasks since ground-truth illuminant corrections, belonging to the same bin, are well clustered in RGB space. Images now belong to a task, conditioned on both camera sensor and CCT bin. This results in valid learning tasks assuming that each CCT bin is non-empty, for each sensor. In the remaining experiments, we set . While ideally we would strive for a larger number of learning tasks and to increase the granularity with which we can separate lighting conditions, we were limited by small image datasets and yielded bins with insufficient image counts per task.

(a) Cube
(b) Canon600D (NUS)
(c) Canon 600D (NUS) ground-truth illuminants in RGB space
Figure 4: Task definition: Temperature histograms and corresponding separation in RGB space.
(a) Gehler-Shi (Canon 5D)
(b) Gehler-Shi (entire dataset)
(c) Cube
(e) NUS-9
(f) NUS-9, one random draw
Figure 5:

Median angular-error with respect to the number of fine-tuning updates. (a) Parameter study results, (b-e) Dataset specific results compared to the baselines, (f) Per camera angular-error after 10 updates, all NUS-9 cameras. Error bars in (a-e) report inter-draws variance.

Parameter and method analysis. Using our validation camera we evaluate the influence of key parameters and meta-learning strategy. We train models using imagery from all remaining cameras for our three considered variants of MAML: MAML, metaSGD and LSLR. For each meta-learning strategy, we evaluate the influence of inner gradient updates and train models for and . While is fixed during training, we evaluate the influence of available training images at test time, computing performance for . We also report results for . Since LSLR learns a different learning rate per inner update, we set , when is set to a value larger than that used during training.

We report results in Table 1, with the best results for each reported in bold. We observe a substantial improvement in performance when increasing the number of inner updates at training time. Of the meta-algorithm variants, MetaSGD appears to have the worst performance. This can be attributed to the fact that the method requires a significantly larger number of parameters, manifesting as training difficulty with our limited amount of training data and task count. LSLR appears to offers a compromise between simplicity and flexibility, yielding the best results when . However, interestingly, LSLR performs poorly compared to MAML when training with only gradient update.

Predictably, we observe an increase in performance as we increase the number of training images , reaching our overall best performance with a median angular error of degrees and training samples. Finally, we can see that LSLR and MetaSGD benefit from an additional number of gradient updates, while MAML appears to reach optimal performance with . Considering our experimental observations, we use and with our LSLR variant for the remainder of our experimental work. We set unless otherwise specified.

Training parameters
1 update 5 updates 1 update 5 updates 1 update 5 updates
Testing parameters = 5
 1 update
 5 updates
 10 updates
= 10
 1 update
 5 updates
 10 updates
= 20
 1 update
 5 updates
 10 updates
Table 1: Our meta-learning hyper-parameter investigation and method analysis. Median angular-error. Best results for each -shot configuration are reported in bold.

Results comparisons. We train a simple baseline model using the introduced network architecture and standard back-prop with leave-one-out cross validation. At test time we report both with (Baseline - fine tune) and without (Baseline - no fine tune) -shot fine tuning. Validation data splits and fine tuning are performed as described in Section 4-Implementation. Baselines are trained for 60 iterations using the same parameters as our Meta-AWB model.

Figure 5 shows the evolution of the median angular error, with respect to the number of gradient updates, for all datasets under both baselines and our Meta-AWB approach. Figure 5(a) reports results from our parameter analysis, highlighting the performance and quick adaptation of the different meta-learning methods with respect to the baselines on the Canon 5D camera (Gehler-Shi). We observe that our approach adapts significantly faster than baselines, with a significant gap in performance for most datasets. The Cube dataset shows the largest improvement, suggesting a large sensor difference to the remaining training data. Baseline fine-tuning adapts the fastest on NUS-9, which can be attributed to the fact that we train on multiple instances of the same scene. Nonetheless, Fig. 5(f) shows per camera performance and highlights that while our approach yields similar performance across NUS-9 cameras, the fine-tuning baseline fails to adapt successfully to several cameras (eg. Olympus, Nikon D40 and Fujifilm) and yields significantly worse performance. Our method provides better performance on each NUS-9 test camera individually, in addition to average performance. Finally, performance on FUNT HDR suggests that our single illuminant approximation may not be optimal.

Finally, we compare our results with recent state of the art approaches using the common benchmark datasets: Gehler-Shi, NUS-8 and Cube. Due to the fact that we only select a subset of FUNT HDR images, and approximate a single illuminant ground truth, FUNT HDR results are not directly comparable. We report results on NUS-8 (without the recently added Nikon D40 camera) to provide a fair and accurate comparison. Quantitative results are shown in Tables 2 (NUS-8), 3 (Gehler-Shi) and 4

(Cube) where we report standard angular-error statistics (Tri. is trimean and G.M. geometric mean). We can see that we obtain results that are competitive with the state of the art and fully supervised methods, despite using significantly less camera specific training data. We achieve good generalisation on all datasets, in particular with the Cube dataset whose sensor and image content appear to be substantially different from the rest of our datasets based on our baseline results. In addition, we obtain results that are on par with fully supervised approaches on the NUS dataset and significantly outperform unsupervised approaches relying on large amounts of data. This superior performance (compared to other considered datasets) can be linked to the fact that the NUS scene content is repeatedly seen during training.

Algorithm Mean Median Tri. Best 25% Worst 25% G.M.
Low-level statistics-based methods
White-Patch [13]
Gray-world [15]
Edge-based Gamut [30]
Natural Image Statistics [29]

Fully-supervised learning

Bayesian [28]
Cheng et al. 2014 [16]
Shi et al. 2016 [50]
CCC [9]
Cheng et al. 2015 [17]
FFCC [10]
SqueezeNet-FC4 [33]
{Unsupervised, Meta} learning
Color Tiger [6]
Table 2: Performance on the NUS-8 dataset [16]. We follow the same format as [10], reporting average performance (geometric mean) over the original NUS cameras.
Algorithm Mean Median Tri. Best 25% Worst 25% G.M.
Low-level statistics-based methods
White-Patch [13]
Gray-world [15]
Edge-based Gamut [30]
Fully-supervised learning
Bayesian [28]
Cheng et al. 2014 [16]
Bianco CNN [11]
Cheng et al. 2015 [17]
CCC [9]
DS-Net [50]
SqueezeNet-FC4 [33]
FFCC [10]
Table 3: Performance on the Gehler-Shi dataset [49, 28]. Previous methods as reported by [10].
Algorithm Mean Median Tri. Best 25% Worst 25% G.M.
Low-level statistics-based methods
White-Patch [13]
Gray-World [15]
General Gray-World [8]
Smart Color Cat [5]
{Unsupervised, Meta} learning
Color Tiger [6]
Restricted Color Tiger [6]
Table 4: Performance on the Cube dataset. Previous methods as reported by [6].

5 Conclusion

We present Meta-AWB, a meta-learning approach to computational color constancy. Our few-shot learning technique enables fast generalisation across four benchmark datasets and different camera sensors resulting in accuracy performance competitive with the camera-specific, fully-supervised state-of-the-art, trained on an order of magnitude more test-camera data. We introduce a novel meta-task definition strategy, driven by image color temperature, that results in distinct regression tasks with intuitive physical meaning. We presented and studied the influence of several variants of our technique exploring the value of , number of gradient updates and meta-learning outer update strategy. We show improved learning ability over standard fine-tuning, resulting in efficient use of only few training samples. Meta-AWB has the ability to generalise quickly and learns to solve the computational color constancy problem in a camera agnostic fashion. This provides the potential for high accuracy performance as new sensors become available yet mitigates arduous and time-consuming calibration of training imagery, required for fully-supervised approaches. The current work relies on a relatively limited number of tasks for meta-training that likely yields a bias benefiting cameras observing similar scene content (NUS data). One would expect to reach better generalisation performance with more imaging content variability per camera. Future work will also investigate how accuracy can be further improved by more complex base-learner components.


  • [1] Datacolor SpyderCube. Accessed: 2018-11-05.
  • [2] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
  • [3] A. Antoniou, H. Edwards, and A. Storkey. How to train your MAML. arXiv preprint arXiv:1810.09502, 2018.
  • [4] Ç. Aytekin, J. Nikkanen, and M. Gabbouj. A data set for camera-independent color constancy. IEEE Transactions on Image Processing, 27(2):530–544, 2018.
  • [5] N. Banić and S. Lončarić. Using the red chromaticity for illumination estimation. In Image and Signal Processing and Analysis (ISPA), 2015 9th International Symposium on, pages 131–136. IEEE, 2015.
  • [6] N. Banic and S. Loncaric. Unsupervised learning for color constancy. CoRR, abs/1712.00436, 2017.
  • [7] K. Barnard. Improvements to gamut mapping colour constancy algorithms. In European conference on computer vision, pages 390–403. Springer, 2000.
  • [8] K. Barnard, V. Cardei, and B. Funt. A comparison of computational color constancy algorithms. i: Methodology and experiments with synthesized data. IEEE transactions on Image Processing, 11(9):972–984, 2002.
  • [9] J. T. Barron. Convolutional color constancy. In Proceedings of the IEEE International Conference on Computer Vision, pages 379–387, 2015.
  • [10] J. T. Barron and Y.-T. Tsai. Fast fourier color constancy. In

    Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 21–26 July 2017

    , 2017.
  • [11] S. Bianco, C. Cusano, and R. Schettini. Color constancy using cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 81–89, 2015.
  • [12] S. Bianco, C. Cusano, and R. Schettini. Single and multiple illuminant estimation using convolutional neural networks. IEEE Transactions on Image Processing, 26(9):4347–4362, 2017.
  • [13] D. H. Brainard and B. A. Wandell. Analysis of the retinex theory of color vision. JOSA A, 3(10):1651–1661, 1986.
  • [14] P. Brazdil, C. G. Carrier, C. Soares, and R. Vilalta. Metalearning: Applications to data mining. Springer Science & Business Media, 2008.
  • [15] G. Buchsbaum. A spatial processor model for object colour perception. Journal of the Franklin institute, 310(1):1–26, 1980.
  • [16] D. Cheng, D. K. Prasad, and M. S. Brown. Illuminant estimation for color constancy: why spatial-domain methods work and the role of the color distribution. JOSA A, 31(5):1049–1058, 2014.
  • [17] D. Cheng, B. Price, S. Cohen, and M. S. Brown. Effective learning-based illuminant estimation using simple features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1000–1008, 2015.
  • [18] M. Feurer, J. T. Springenberg, and F. Hutter. Initializing bayesian hyperparameter optimization via meta-learning. In AAAI, pages 1128–1135, 2015.
  • [19] G. D. Finlayson. Color in perspective. IEEE transactions on Pattern analysis and Machine Intelligence, 18(10):1034–1038, 1996.
  • [20] G. D. Finlayson and E. Trezzi. Shades of gray and colour constancy. In Color and Imaging Conference, volume 2004, pages 37–41. Society for Imaging Science and Technology, 2004.
  • [21] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
  • [22] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via meta-learning. arXiv preprint arXiv:1709.04905, 2017.
  • [23] D. A. Forsyth. A novel algorithm for color constancy. International Journal of Computer Vision, 5(1):5–35, 1990.
  • [24] B. Funt, K. Barnard, and L. Martin. Is machine colour constancy good enough? In European Conference on Computer Vision, pages 445–459. Springer, 1998.
  • [25] B. Funt and L. Shi. The rehabilitation of maxrgb. In Color and imaging conference, volume 2010, pages 256–259. Society for Imaging Science and Technology, 2010.
  • [26] B. Funt and W. Xiong. Estimating illumination chromaticity via support vector regression. In Color and Imaging Conference, volume 2004, pages 47–52. Society for Imaging Science and Technology, 2004.
  • [27] S.-B. Gao, M. Zhang, C.-Y. Li, and Y.-J. Li. Improving color constancy by discounting the variation of camera spectral sensitivity. JOSA A, 34(8):1448–1462, 2017.
  • [28] P. V. Gehler, C. Rother, A. Blake, T. Minka, and T. Sharp. Bayesian color constancy revisited. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
  • [29] A. Gijsenij and T. Gevers. Color constancy using natural image statistics and scene semantics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):687–698, 2011.
  • [30] A. Gijsenij, T. Gevers, and J. Van De Weijer. Generalized gamut mapping using image derivative structures for color constancy. International Journal of Computer Vision, 86(2-3):127–139, 2010.
  • [31] A. Gijsenij, T. Gevers, J. Van De Weijer, et al. Computational color constancy: Survey and experiments. IEEE Transactions on Image Processing, 20(9):2475–2489, 2011.
  • [32] J. Hernandez-Andres, R. L. Lee, and J. Romero. Calculating correlated color temperatures across the entire gamut of daylight and skylight chromaticities. Applied optics, 38(27):5703–5709, 1999.
  • [33] Y. Hu, B. Wang, and S. Lin. Fc4: Fully convolutional color constancy with confidence-weighted pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), pages 4085–4094, 2017.
  • [34] R. Jacobson, S. Ray, G. G. Attridge, and N. Axford. Manual of Photography. Taylor & Francis, 2000.
  • [35] M. A. Jamal, G.-J. Qi, and M. Shah. Task-agnostic meta-learning for few-shot learning. arXiv preprint arXiv:1805.07722, 2018.
  • [36] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [37] E. H. Land and J. J. McCann. Lightness and retinex theory. Josa, 61(1):1–11, 1971.
  • [38] B. Li, W. Xiong, W. Hu, and B. Funt. Evaluating combinational illumination estimation methods on real-world images. IEEE Transactions on Image Processing, 23(3):1194–1209, 2014.
  • [39] Z. Li, F. Zhou, F. Chen, and H. Li. Meta-SGD: Learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835, 2017.
  • [40] Z. Lou, T. Gevers, N. Hu, M. P. Lucassen, et al. Color constancy by deep learning. In BMVC, pages 76–1, 2015.
  • [41] L. Metz, N. Maheswaranathan, B. Cheung, and J. Sohl-Dickstein. Learning unsupervised learning rules. arXiv preprint arXiv:1804.00222, 2018.
  • [42] A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. CoRR, abs/1803.02999, 2018.
  • [43] R. Ramanath, W. E. Snyder, Y. Yoo, and M. S. Drew. Color image processing pipeline. IEEE Signal Processing Magazine, 22(1):34–43, 2005.
  • [44] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2017.
  • [45] C. Rosenberg, A. Ladsariya, and T. Minka. Bayesian color constancy with non-gaussian models. In Advances in neural information processing systems, pages 1595–1602, 2004.
  • [46] J. Schanda. Colorimetry: understanding the CIE system. John Wiley & Sons, 2007.
  • [47] J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. PhD thesis, Technische Universität München, 1987.
  • [48] E. F. Schubert. Light-emitting diodes. E. Fred Schubert, 2018.
  • [49] L. Shi and B. Funt. Re-processed version of the gehler color constancy dataset of 568 images. http://www. cs. sfu. ca/~ color/data/, 2000.
  • [50] W. Shi, C. C. Loy, and X. Tang. Deep specialized network for illuminant estimation. In European Conference on Computer Vision, pages 371–387. Springer, 2016.
  • [51] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
  • [52] S. Thrun and L. Pratt. Learning to learn: Introduction and overview. In Learning to learn, pages 3–17. Springer, 1998.
  • [53] K. Tieu and E. G. Miller. Unsupervised color constancy. In Advances in neural information processing systems, pages 1327–1334, 2003.
  • [54] J. Van De Weijer, T. Gevers, and A. Gijsenij. Edge-based color constancy. IEEE Transactions on image processing, 16(9):2207–2214, 2007.
  • [55] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
  • [56] M. Vrhel, E. Saber, and H. J. Trussell. Color image generation and display technologies. IEEE Signal Processing Magazine, 22(1):23–33, 2005.
  • [57] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2(6), 2017.

Appendix A Task definition using color temperature histograms

In our paper, a task is defined as a collection of images from a specific camera and from a specific bin of the CCT histogram. We provide additional examples of CCT histograms and their corresponding clustering of ground-truth (GT) illuminant corrections in RGB space. Results for the bins case, for two cameras (Canon 600D and Nikon D40 from the NUS-9 dataset [16]) are shown in Fig. 6. For the Canon 600D, Fig. 6(a) shows an acceptable number of images in each bin for defining a task and 6(b) exhibits the related ground-truth illuminant clustering for . For the Nikon D40, a critical issue is that when using histogram bins, there are not enough images () available in the first bin as shown in 6(c). In Fig. 6(d) it can be observed that the ground-truth illuminants in bins two and three retain reasonable cluster separability, suggesting that more image data would help to make increased histogram granularity feasible. These results motivate the choice of , as shown in Figure of the main paper, providing distinctive tasks with sufficient number of images in each bin of the CCT histogram.

(a) Canon 600D histogram
(b) Canon 600D GT illuminants in RGB space
(c) Nikon D40 histogram
(d) Nikon D40 GT illuminants in RGB space
Figure 6: CCT histograms , 3 bin example.

Appendix B Base-learner architecture details

Figure 7 shows the base-learner network architecture considered in our work. As described in the manuscript, the base architecture is composed of four convolutional layers (all sized ), an average pooling layer and two fully connected layers (sizes , ) all with ReLU activations, except for the last layer.

Figure 7: Illustration of the considered network architecture.

Appendix C Funt HDR data set

c.1 Image selection per scene

We make use of the HDR image dataset captured by Funt and Shi [25] and consider scenes for which ground-truth illuminants are available. The dataset contains subsets of (up to) images per scene, captured over short time frames (fps). During capture, the camera was able to auto-adjust shutter speed and aperture settings between frames [25]. In other words, the -stop setting was not fixed. We choose a representative image per scene by selecting a single exposure, shutter speed and aperture configuration and in practice we find the scene image with aperture -stop nearest to in an attempt to balance between reasonable depth of field, retaining sharp objects in frame and also the amount of light admitted.

c.2 Macbeth Color Checker

In each scene, Macbeth Color Checkers (MCC) are present in positions such that the illumination incident is expected to represent overall scene illumination. To realise our (unique) dominant scene illuminant assumption, we compute the median GT illuminant per scene to approximate a dominant illumination. In Figure 8 we investigate the validity of this approach. For each scene, we compute angular error between the median GT illuminant and each of the distinct GT scene illuminants, provided by each MCC. Figure 8(a) shows per-scene boxplots of the computed angular errors. While we observe that most median errors are smaller than degrees, several images have very large angular differences between MCCs. This suggests that our single illuminant approximation may not be suitable for these images, explaining our relatively poorer performance on the Funt HDR dataset.

In Figure 8(b), we show a histogram computing the minimum distance between (any) scene MCC illuminant and the median GT illuminant (used in our paper as ground-truth). We observe that the large majority have a distance smaller than degree, suggesting that our approximation is very similar to an approach of choosing a single MCC as GT, under the single illuminant approximation.

(a) Boxplot of the angular errors between our considered median GT and the GT illuminant of each MCC.
(b) Histogram of the minimum angular error between each MCC and the median GT.
Figure 8: Intra-scene MCC ground-truth variability for considered FUNT HDR images [25].

Appendix D Interpretation of NUS camera results

The NUS-8 camera dataset [16] is comprised of subsets of images representing largely similar scenes and illuminants captured by different cameras. Due to the scene and illumination invariance, differences in GT illuminant distributions can therefore be associated with differences in Camera Spectral Sensitivity (CSS). In Fig. 9 we plot the median GT illuminant correction in RGB space per camera for 8 NUS subsets and reproduce our NUS per camera accuracy results from the main manuscript, where we compared our approach with that of the baseline.

By inspecting Fig. 9(a), it may be seen that the ‘Fujifilm’ and ‘Olympus’ cameras have distinct and relatively distant median GT in RGB space compared with the other cameras in the NUS dataset. This correlates well with the results we obtain using the baseline fine-tuning method, where the naïve fine-tuning of these two cameras results in weak median angular-error performance. This is demonstrated in Fig. 9(b), where larger errors for these cameras are observed for the baseline approach visualized in a blue color. On the contrary, our Meta-AWB approach is able to quickly adapt to large differences in camera sensor and yields consistent performance across all cameras investigated, demonstrated by consistently lower error, visualized in green.

It should be noted that the Nikon D40 camera GT is not plotted in Fig. 9(a). Acquired later, this image set has substantially fewer images and no direct scene-to-scene correspondence, making a median GT comparison to the rest of the cameras less meaningful. The lack of scene correspondence may partially explain the weak baseline naïve fine-tuning results also observed for this camera.

(a) Median ground-truth RGB correction for NUS-8 cameras.
(b) Median angular error for each NUS camera, as shown in the main paper.
Figure 9: Results on NUS-9 cameras and median GT RGB corrections.

Appendix E Qualitative results

In Figure 10 we provide additional qualitative results in the form of images from the Gehler-Shi dataset [28, 49]. For each image we show the input image produced by the camera and a white-balanced image corrected using the ground-truth illumination. We also show the output of our model (“Meta-AWB”), and that of the baseline fine-tuning approach reported in the paper. Color checker boards are visible in the images, however the relevant areas are masked prior to inference. Images are shown in sRGB space and clipped at the percentile.

In similar fashion to [9], we adopt the strategy of sorting the images in the dataset by the combined mean angular-error of the evaluated methods. We present images of increasing average difficulty however images to report were selected by instead ordering from “hard” to “easy” and sampling with a logarithmic spacing, providing more samples that proved challenging on average.

(a) Input image
(b) Ground-truth solution
(c) Meta-AWB, err =
(d) Baseline fine-tuning, err =
(e) Input image
(f) Ground-truth solution
(g) Meta-AWB, err =
(h) Baseline fine-tuning, err =
(i) Input image
(j) Ground-truth solution
(k) Meta-AWB, err =
(l) Baseline fine-tuning, err =
(m) Input image
(n) Ground-truth solution
(o) Meta-AWB, err =
(p) Baseline fine-tuning, err =
(q) Input image
(r) Ground-truth solution
(s) Meta-AWB, err =
(t) Baseline fine-tuning, err =
(u) Input image
(v) Ground-truth solution
(w) Meta-AWB, err =
(x) Baseline fine-tuning, err =
Figure 10: For each scene we present the input image produced by the camera alongside the ground-truth white-balanced image. Images are shown in sRGB space and normalized to the th percentile. For both our ‘Meta-AWB’ approach and the baseline algorithm we show the white-balanced image, as well as the angular-error of the estimated illumination in degrees, with respect to the ground-truth.