Traditionally, Image Signal Processors (ISPs) are designed to optimize human appreciation of photographs. The core task is, given a raw measurement of an array of sensors (with different sensitivity to different light frequencies), to produce an image that looks natural to the human observer. To this end, there is a need to estimate the right colors and tone, and also compensate for acquisition process artifacts, such as noise.
However, in many domains, scenes are captured for machine consumption only. Such examples include robots’ cameras, autonomous driving and security cameras that are automatically monitored. In these domains, the objective is not to produce visually pleasing images, but rather achieve high performance in a given downstream task, e.g. object recognition. Thus, discarding the ISP and training the model directly on the raw data is tempting in these cases, as it saves the compute overhead of the ISP. Unfortunately, simply discarding the ISP is shown in the literature to cause performance degradation [7, 5, 3]. This happens because the ISP serves as a ‘normalization’ of the data, transforming it to a canonical space, which is independent (or less dependent) of the camera used or capturing environment. Solutions discussed in the literature usually involve an adjustment of the ISP for the vision task, either manually  or via learning . Our method deals with the case where the ISP is discarded and mitigates the drop in performance.
Training a model on RAW images requires annotating the data. Human labeling of RAW images is impossible, as they are sometimes almost unrecognizable by humans. The alternative is to use the RGB images for labeling and transferring the labels to RAW. The transferring to RAW can be done by having an inverse model of the ISP (). So given a dataset of labeled RGB images , we can generate a matching labeled RAW dataset relying on pixel alignment between RAW and RGB to get the labels for the RAW [5, 7, 3].
In this work, instead of having an inverse model of the ISP for transferring the labels from RGB to RAW, we use a dataset of RAW-RGB pairs. Building such dataset is relatively easy since it does not require human labeling. Then, by labeling the RGB images we immediately get labels for the RAW too. However, we wish to reduce the labeling cost, and thus instead of manually labeling the RGB images we use a pre-trained model to label them.
Using the above-mentioned dataset we can train our model on RAW images with the transferred labels as ground truth. Yet, we still suffer from a big drop in performance compared to using the RGB inputs. To mitigate that, we suggest using Knowledge Distillation (KD) 
to make the RAW predictions fit the RGB predictions. KD is a technique that is known to work quite well for compressing deep models, i.e., making a smaller model behave similarly to a larger model. Here we use KD for compressing both the heuristically designed ISP and the classification deep model into a new deep model for classification of the same size of only the latter part. We show the advantage of our approach also when training the network to have similar predictions for short-exposure RAW images. In that case, in addition to the ISP and classification, the model also compresses the ‘knowledge’ of an ideal (non-existing) denoiser that maps the short-exposure RAW to a longer-exposure RAW image.
2 Related Work
ISP for Vision. The ISP consists of a set of algorithms, usually applied sequentially, intended for transforming a RAW image into a visually appealing RGB image. The different steps either fix some degradation in the acquisition process (e.g., noise) or just transform the image to better fit the human vision (e.g., tone mapping). Recently, some works suggested replacing some or all of these operations with a learned model [15, 4]. But these designed or learned ISPs are optimized for the visual appearance and not the vision tasks.
Naively dropping the ISP does not work well. Several works [7, 5, 3] used simulated RAW images to train a classifier and observed a substantial gap in accuracy performance. Hansen et al.  report a larger gap for smaller models, about drop in performance for MobileNet, attributed to the failure of compact models to compensate for the lack of ISP. Buckler et al.  identified the lack of demosaicing and gamma correction to be a critical cause for performance degradation. They suggested modifying the imaging sensor such that demosaicing and gamma correction are no longer necessary, which limits the effect of lack of ISP.
Some methods suggested optimizing the ISP for downstream vision tasks. Yahiaoui et al.  suggested tuning the parameters of a traditional (not learned) ISP to improve object detection. Sharma et al.  add a component that takes an RGB image processed by the ISP and further enhance it for the downstream task. Diamond et al.  suggested jointly learning a low-level processing module, which performs deonising and deblurring, with the classifier. They train with simulated RAW images. Wu et al.  suggested VisionISP, a trainable ISP, which is trained to optimize object detection in an autonomous driving setting. In this work, instead of designing or training an image processing module, i.e., a module whose input and output are images, we are focused on the case that the vision model is operating on RAW images.
Knowledge Distillation. Compressing larger models into smaller ones was first suggested by Bucilua et al. 
. Application of this technique to deep neural networks, known as Knowledge Distillation, was suggested by Hinton et al.
. The key idea behind KD is that the soft labels output (or soft probabilities) of a classifier contain much more information about the data-point than the hard label. KD has been wildly used for numerous different applications, e.g.[1, 11, 12]
. It has also been shown to work for distilling knowledge of non-neural network machine-learning models. In this work, we use KD to distill not just a deep model, but also the non-learned (manually engineered) algorithms in the ISP and the information gained in the physical process of acquiring a better signal (that has a better SNR).
We are interested in training a classification model to operate on images from one modality, when no semantic labels are available. What we have is the function that maps the images to another modality for which we have a pre-trained model. Alternatively, we might have a dataset of pair of images from two different modalities with pixel alignment, even if a function that maps these modalities does not exist (or we do not have access to it). In our case, the two modalities are a and a processed image . We are also interested in operating on a short exposure image, , where the reference processed image is based on a longer exposure, , or multiple short exposure RAW images, (clearly, there is no deterministic mapping such that ).
In this work, instead of having an inverse model of the ISP for transferring the labels from RGB to RAW, we use a dataset of RAW-RGB pairs. Then by labeling the RGB images we immediately get labels for the RAW too. However, we wish to reduce the labeling cost and thus instead of manually labeling the RGB images, we use a pretrained model, , to label them. We argue that training on the dataset , where
are the hard labels (e.g., an integer value or a one-hot encoding), with the cross-entropy (CE) loss is not the best-option. Instead of using, predicted by our pretrained model. Following many works that have shown its benefits, we use the KD loss.
Given the probability vectors
which are the outputs of a softmax layer with temperature. The KD loss is given by . This loss simultaneously distills the information from the heuristically designed ISP and the CNN model (classifier) pretrained on RGB images, .
In the case of short-exposure RAW, we use
and we also implicitly learn to classify low-dynamic-range and extremely noisy images. As commonly done, we use a linear combination of the CE loss and KD loss . In our experiments, we found and to work well.
It is a common practice to normalize the inputs so each channel has zero-mean and unit-STD (calculated on the relevant training-set). For the RAW images, we do the same, where the mean and STD are calculated separately for R,G, and B pixels in the Bayer pattern.
In practice, for faster training, we initialize the RAW classifier, , with the weights of the pre-trained RGB model
. Using this initialization, we can resort to short training (4 epochs in our experiments). Since RAW images have a single channel and not 3 RGB channels, we need to make some adaptation to be able to use off-the-shelf classifiers with them (especially when initializing with pre-trained models). We do so in a very simple way by transforming the RAW images to RGB through filling the missing values using a bilinear interpolation.
To validate our approach we tested on two cases. In the first, we test performance when operating on noisy and mosaiced images, i.e., discarding the denoising and demosiacing pre-processing. In the second case, we test the performance when the full ISP is discarded.
4.1 Discarding Denoising and Demosaicing
We first test our method on synthetically generated RAW images, since for these images we can compare to the classification performance of training with ground truth labels. In this experiment, we limit the simulated ISP (we want to discard) to include only denoising and demosaicing. Thus, the RAW images are the RGB images subsampled according to a Bayer pattern with added Gaussian noise. We use the ImageNet dataset for this experiment (ILSVRC2017).
results for the mosaiced and noisy ImageNet validation set, with noise STD equals to(pixel values in ). Under such a distortion, the performance of ResNet18 drops from top-1 accuracy to . Training the model on distorted images (RAW) using either ground truth labels or labels produced by a pretrained ResNet18 (based on the clean RGB images) improves performance to . Using the proposed ISP Distillation we improve the performance by more than . Similar trends are observed for MobileNetV2 too. Fig. 2 shows the improvement is consistent across noise levels.
|Clean images (upper bound)||69.76||89.08||71.88||90.29|
|Trained w/ GT labels||57.21||80.76||56.31||80.06|
|Trained w/ predicted labels||56.59||79.70||56.73||80.15|
|ISP Distillation (ours)||61.35||83.70||61.43||83.95|
4.2 Discarding the Full ISP
To test the effect of discarding of the full ISP, we use real data based on images from the HDR+ dataset . This dataset includes bursts (containing images in total). Each burst has between 2 to 10 short-exposure raw photos, where each is generally 12-13 Mpixels, depending on the type of camera used for the capturing. The images in a burst are generally captured with the same exposure time and gain. The dataset also provides for each burst a merged RAW image that is generated by aligning the short-exposure RAW images and combining them to produce a single high-dynamic-range RAW. This merged RAW is then processed by their ISP to produce the final RGB image. See  for more details.
We performed two kinds of experiments. In the first, we want to distill the ISP and a pre-trained classification model. Thus, we use the merged RAW as input to our model, trying to mimic the predictions on the final RGB. In the second experiment, we choose a single short-exposure RAW and train our model to mimic the predictions on the final RGB, produced from the merged RAW. We always choose the short-exposure RAW that is pixel-aligned with the merged RAW (information provided in the dataset).
In these experiments there is no ground truth for evaluation, so the top-1 and top-5 accuracies are measured as the agreement of our model with the predictions of the pre-trained classification model on the final RGB. The bursts are randomly split into training set and test set. The images in the original HDR+ dataset are of very high resolution, but popular classifier architectures expect images to be in the range of pixels. While we could down-sample the images, it would have removed the effect of the Bayer pattern (and potentially the effect of the noise too), and we are interested in understating the ability of the model to overcome these artifacts. Therefore, we chose to split the images into smaller non-overlapping patches. For training, we use all the patches originated from the bursts in the training set. For the test, since many of the patches do not contain an object, we only tested those for which the pre-trained RGB classifier predicted an object with probability .
Table 2 compares the performance of training the RAW model with predicted labels vs. applying ISP Distillation. Our approach shows a substantial improvement of top-1 accuracy for ResNet18 and for MobileNet V2. It also exhibits a similar advantage for the model trained on short-exposure RAW, where the improvement is and for ResNet18 and MobileNet V2, respectively. A noticeable advantage exists in all experiments for top-5 accuracy too, bringing the accuracy to . This suggests that the RAW model predicted probability distribution highly matches the one from the RGB model.
For the short-exposure RAW experiment, we also compared to a sequential combination of a network that performs the ISP part (DeepISP ) before the classifier, where both are trained end-to-end to optimize classification performance. DeepISP flop count is (compared to ResNet18’s and MobileNetV2’s ). Note that the classifier alone, trained with ISP Distillation, performs almost as good as the combination of models, which is more computationally demanding.
|Classifying RAW images:|
|Training with labels||84.73||95.40||87.16||97.37|
|ISP Distillation (ours)||93.83||98.12||93.32||98.08|
|Classifying short-exposure RAW images:|
|Training with labels||83.73||95.31||80.62||94.48|
|ISP Distillation (ours)||91.87||96.98||91.57||96.99|
|With ISP (DeepISP )||91.86||97.67||92.30||97.71|
4.3 Ablation studies
Since we initialize our RAW model with ImageNet pre-trained weights we want to test how many of the layers need to be adapted to accommodate the RAW input. Is it just about local artifacts and finetuning only the first layers will be enough or there are global distortions that requires higher-level features from deeper layers to adapt too? Table 3 shows that most of the gains are thanks to the finetuning of the first layers, and training only the first half of ResNet18 is almost as good as training the full model.
|Training the first quarter of the model||90.61|
|Training the first half of the model||91.66|
|Training Full model||91.87|
We have shown that it is possible to distill the knowledge of not just a pre-trained model but also heuristically designed ISP, to improve the performance of a classifier for RAW images. We also show improvement for short-exposure RAW, distilling the information of the physical process of acquiring a better signal. ISP Distillation is a step towards reaching similar classification performance on RAW images compared to RGB. It can advance the deployment of vision models on RAW images for domains where the images are consumed by machines and not humans, saving the compute cost of the ISP.
-  (2018) Label refinery: improving imagenet classification through label progression. arXiv preprint arXiv:1805.02641. Cited by: §2.
-  (2006) Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §2.
Reconfiguring the imaging pipeline for computer vision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 975–984. Cited by: §1, §1, §2.
Learning to see in the dark.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3291–3300. Cited by: §2.
-  (2017) Dirty pixels: optimizing image classification architectures for raw sensor data. arXiv preprint arXiv:1701.06487. Cited by: §1, §1, §2, §2.
-  (2019) Distilling knowledge for non-neural networks. In 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1411–1416. Cited by: §2.
ISP4ML: understanding the role of image signal processing in efficient deep learning vision systems. arXiv preprint arXiv:1911.07954. Cited by: §1, §1, §2.
-  (2016) Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) 35 (6). Cited by: §4.2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.
-  (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §2.
-  (2019) Few-shot image recognition with knowledge transfer. In Proceedings of the IEEE International Conference on Computer Vision, pp. 441–449. Cited by: §2.
-  (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.1.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §4.1.
-  (2018) DeepISP: toward learning an end-to-end image processing pipeline. IEEE Transactions on Image Processing 28 (2), pp. 912–923. Cited by: §2, §4.2, Table 2.
-  (2018) Classification-driven dynamic image enhancement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4033–4041. Cited by: §2.
-  (2019) Visionisp: repurposing the image signal processor for computer vision applications. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 4624–4628. Cited by: §1, §2.
-  (2019) Overview and empirical analysis of isp parameter tuning for visual perception in autonomous driving. Journal of Imaging 5 (10), pp. 78. Cited by: §1, §2.