Bridging the Gap Between Computational Photography and Visual Recognition

01/28/2019 ∙ by Rosaura G. VidalMata, et al. ∙ University of Notre Dame 0

What is the current state-of-the-art for image restoration and enhancement applied to degraded images acquired under less than ideal circumstances? Can the application of such algorithms as a pre-processing step to improve image interpretability for manual analysis or automatic visual recognition to classify scene content? While there have been important advances in the area of computational photography to restore or enhance the visual quality of an image, the capabilities of such techniques have not always translated in a useful way to visual recognition tasks. Consequently, there is a pressing need for the development of algorithms that are designed for the joint problem of improving visual appearance and recognition, which will be an enabling factor for the deployment of visual recognition tools in many real-world scenarios. To address this, we introduce the UG^2 dataset as a large-scale benchmark composed of video imagery captured under challenging conditions, and two enhancement tasks designed to test algorithmic impact on visual quality and automatic object recognition. Furthermore, we propose a set of metrics to evaluate the joint improvement of such tasks as well as individual algorithmic advances, including a novel psychophysics-based evaluation regime for human assessment and a realistic set of quantitative measures for object recognition performance. We introduce six new algorithms for image restoration or enhancement, which were created as part of the IARPA sponsored UG^2 Challenge workshop held at CVPR 2018. Under the proposed evaluation regime, we present an in-depth analysis of these algorithms and a host of deep learning-based and classic baseline approaches. From the observed results, it is evident that we are in the early days of building a bridge between computational photography and visual recognition, leaving many opportunities for innovation in this area.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The advantages of collecting imagery from autonomous vehicle platforms such as small UAVs are clear. Man-portable systems can be launched from safe positions to penetrate difficult or dangerous terrain, acquiring hours of video without putting human lives at risk during search and rescue operations, disaster recovery, and other scenarios where some measure of danger has traditionally been a stumbling block. Similarly, cars equipped with vision systems promise to improve road safety by more reliably reacting to hazards and other road users compared to humans. However, what remains unclear is how to automate the interpretation of what are inherently degraded images collected in such applications — a necessary measure in the face of millions of frames from individual flights or road trips. A human-in-the-loop cannot manually sift through data of this scale for actionable information in real-time. Ideally, a computer vision system would be able to identify objects and events of interest or importance, surfacing valuable data out of a massive pool of largely uninteresting or irrelevant images, even when that data has been collected under less than ideal circumstances. To build such a system, one could turn to recent machine learning breakthroughs in visual recognition, which have been enabled by access to millions of training images from the Internet 

[1, 2]. However, such approaches cannot be used as off-the-shelf components to assemble the system we desire, because they do not take into account artifacts unique to the operation of the sensor and optics configuration on an acquisition platform, nor are they strongly invariant to changes in weather, season, and time of day.

Whereas deep learning-based recognition algorithms can perform on par with humans on good quality images [3, 4], their performance on distorted samples is degraded. It has been observed that the presence of imaging artifacts can severely impact the recognition accuracy of state-of-the-art approaches [5, 6, 7, 8, 9, 10, 11]

. Having a real-world application such as a search and rescue drone or autonomous driving system fail in the presence of ambient perturbations such as rain, haze or even motion induced blur could have unfortunate aftereffects. Consequently, developing and evaluating algorithms that can improve the object classification of images captured under less than ideal circumstances is fundamental for the implementation of visual recognition models that need to be reliable. And while one’s first inclination would be to turn to the area of computational photography for algorithms that remove corruptions or gain resolution, one must ensure that they are compatible with the recognition process itself, and do not adversely affect the feature extraction or classification processes (Fig. 

1) before incorporating them into a processing pipeline that corrects and subsequently classifies images.

The computer vision renaissance we are experiencing has yielded effective algorithms that can improve the visual appearance of an image [12, 13, 14, 15, 16], but many of their enhancing capabilities do not translate well to recognition tasks as the training regime is often isolated from the visual recognition aspect of the pipeline. In fact, recent works [5, 17, 18, 19]

have shown that approaches that obtain higher scores on classic quality estimation metrics (namely Peak Signal to Noise Ratio), and thus, would be expected to produce high quality images, do not necessarily perform well at improving or even maintaining the original image classification performance. Taking this into consideration, we propose to bridge the gap between traditional image enhancement approaches and visual recognition tasks as a way to jointly increase the abilities of enhancement techniques for both scenarios.

In line with the above objective, in this work we introduce UG: a large-scale video benchmark for assessing image restoration and enhancement for visual recognition. It consists of a publicly available dataset (http://www.ug2challenge.org) composed of videos captured from three difficult real-world scenarios: uncontrolled videos taken by UAVs and manned gliders, as well as controlled videos taken on the ground. Over

annotated frames for hundreds of ImageNet classes are available. From the base dataset, different enhancement tasks can be designed to evaluate improvement in visual quality and automatic object recognition, including supporting rules that can be followed to execute such evaluations in a precise and reproducible manner. This article describes the creation of the UG

dataset as well as the advances in visual enhancement and recognition that have been possible as a result.

Fig. 1:

(Top) In principle, enhancement techniques like the Super-Resolution Convolutional Neural Network (SRCNN) 

[20] should improve visual recognition performance by creating higher quality inputs for recognition models. (Bottom) In practice, this is not always the case, especially when new artifacts are unintentionally introduced, such as in this application of Deep Video Deblurring [16].

Specifically, we summarize the results of the IARPA sponsored UG Challenge workshop held at CVPR 2018. The challenge consisted of two specific tasks defined around the UG dataset: (1) image restoration and enhancement to improve image quality for manual inspection, and (2) image restoration and enhancement to improve the automatic classification of objects found within individual images. The UG dataset contains manually annotated video imagery (including object labels and bounding boxes) with an ample variety of imaging artifacts and optical aberrations (see Fig. 4 in Sec. 4 below); thus it allows for the development and quantitative evaluation of image enhancement algorithms. Participants in the challenge were able to use the provided imagery and as much out-of-dataset imagery as they liked for training and validation purposes. Enhancement algorithms were then submitted for evaluation and results were revealed at the end of the competition period.

The competition resulted in six new algorithms, designed by different teams, for image restoration and enhancement in challenging image acquisition circumstances. These algorithms included strategies to dynamically estimate corruption and choose the appropriate response, the simultaneous targeting of multiple artifacts, the ability to leverage known image priors that match a candidate probe image, super-resolution techniques adapted from the area of remote sensing, and super-resolution via Generative Adversarial Networks. This was the largest concerted effort to-date to develop new approaches in computational photography supporting human preference and automatic recognition. We look at all of these algorithms in this article.

Having a good stable of existing and new restoration and enhancement algorithms is a nice start, but are any of them useful for the image analysis tasks at hand? Here we take a deeper look at the problem of scoring such algorithms. Specifically, the question of whether or not researchers have been doing the right thing when it comes to automated evaluation metrics for tasks like deconvolution, super-resolution and other forms of image artifact removal is explored. We suggest a visual psychophysics-inspired assessment regime, where human perception is the reference point, as an alternative to other forms of automatic and manual assessment that have been proposed in the literature. Using the methods and procedures of psychophysics that have been developed for the study of human vision in psychology, we can perform a more principled assessment of image improvement than just a simple A/B test, which is common in computer vision. We compare this human experiment with the recently introduced Learned Perceptual Image Patch Similarity (LPIPS) metric proposed by Zhang

et al. [21]. Further, when it comes to assessing the impact of restoration and enhancement algorithms on visual recognition, we suggest that the recognition performance numbers are the only metric that one should consider. As we will see from the results, much more work is needed before practical applications can be supported.

In summary, the contributions of this article are:

  • A new video benchmark dataset representing both ideal conditions and common aerial image artifacts, which we make available to facilitate new research and to simplify the reproducibility of experimentation.

  • A set of protocols for the study of image enhancement and restoration for image quality improvement, as well as visual recognition. This includes a novel psychophysics-based evaluation regime for human assessment and a realistic set of quantitative measures for object recognition performance.

  • An extensive evaluation of the influence of image aberrations and other problematic conditions on common object recognition models including VGG16 and VGG19 [22], InceptionV3 [23], and ResNet50 [24].

  • The introduction of six new algorithms for image enhancement or restoration, which were created as part of the UG Challenge workshop held at CVPR 2018. These algorithms are pitted against eight different classical and deep learning-based baseline algorithms from the literature on the same benchmark data.

  • A series of recommendations on specific aspects of the problem that the field should focus its attention on so that we have a better chance at enabling scene understanding under less than ideal image acquisition circumstances.

2 Related Work

Datasets. The areas of image restoration and enhancement have a long history in computational photography, with associated benchmark datasets that are mainly used for the qualitative evaluation of image appearance. These include very small test image sets such as Set5 [13] and Set14 [12], the set of blurred images introduced by Levin et al. [25], and the DIVerse 2K resolution image dataset (DIV2K) [26] designed for super-resolution benchmarking. Datasets containing more diverse scene content have been proposed including Urban100 [15] for enhancement comparisons and LIVE1 [27] for image quality assessment. While not originally designed for computational photography, the Berkeley Segmentation Dataset has been used by itself [15] and in combination with LIVE1 [28] for enhancement work. The popularity of deep learning methods has increased demand for training and testing data, which Su et al. provide as video content for deblurring work [16]. Importantly, none of these datasets were designed to combine image restoration and enhancement with recognition for a unified benchmark.

Most similar to the dataset we employ in this paper are various large-scale video surveillance datasets, especially those which provide a “fixed" overhead view of urban scenes [29, 30, 31, 32]. However, these datasets are primarily meant for other research areas (e.g

., event/action understanding, video summarization, face recognition) and are ill-suited for object recognition tasks, even if they share some common imaging artifacts that impair recognition as a whole.

With respect to data collected by aerial vehicles, the VIRAT Video Dataset [33] contains “realistic, natural and challenging (in terms of its resolution, background clutter, diversity in scenes)" imagery for event recognition, while the VisDrone2018 Dataset [34] is designed for object detection and tracking. Other datasets including aerial imagery are the UCF Aerial Action Data Set [35], UCF-ARG [36], UAV123 [37], and the multi-purpose dataset introduced by Yao et al. [38]. As with the computational photography datasets, none of these sets have protocols for image restoration and enhancement coupled with object recognition.

Visual Quality Enhancement. There is a wide variety of enhancement methods dealing with different kinds of artifacts, such as deblurring (where the objective is to recover a sharp version of a blurry image without knowledge of the blur parameters) [39, 40, 25, 41, 42, 43, 44, 45], denoising (where the goal is the restoration of an image from a corrupted observation , where

is assumed to be noise with variance

[41, 42, 46, 47], compression artifact reduction (which focuses on removing blocking artifacts, ringing effects or other lossy compression-induced degradation) [48, 49, 50, 51], reflection removal [52, 53], and super-resolution (which attempts to estimate a high-resolution image from one or more low-resolution images) [54, 55, 56, 57, 58, 59, 20, 60, 61]. Other approaches designed to deal with atmospheric perturbations include dehazing (which attempts to recover the scene radiance , the global atmospheric light and the medium transmission from a hazy image [62, 63, 64, 65, 66], and rain removal techniques [67, 68, 69, 70].

Most of these approaches are tailored to address a particular kind of visual aberration, and the presence of multiple problematic conditions in a single image might lead to the introduction of artifacts by the chosen enhancement technique. Recent work has explored the possibility of handling multiple degradation types  [71, 72, 73, 74].

Visual Enhancement for Recognition. Intuitively, if an image has been corrupted, then employing restoration techniques should improve performance of recognizing objects in the image. An early attempt at unifying a high-level task like object recognition with a low-level task like deblurring was performed by Zeiler et al. through deconvolutional networks [75, 76]. Similarly, Haris et al. [18] proposed an end-to-end super resolution training procedure that incorporated detection loss as a training objective, obtaining superior object detection results compared to traditional super-resolution methods for a variety of conditions (including additional perturbations on the low resolution images such as the addition of Gaussian noise).

Sajjadi et al.  [19] argue that the use of traditional metrics such as Peak Signal to Noise Ratio (PSNR), Structural Similarity Index (SSIM), or the Information Fidelity Criterion (IFC) might not reflect the performance of some models, and propose the use of object recognition performance as an evaluation metric. They observed that methods that produced images of higher perceptual quality obtained higher classification performance despite obtaining low PSNR scores. In agreement with this, Gondal et al. [77] observed the correlation of the perceptual quality of an image with its performance when processed by object recognition models. Similarly, Tahboub et al. [78] evaluate the impact of degradation caused by video compression on pedestrian detection. Other approaches have used visual recognition as a way to evaluate the performance of visual enhancement algorithm for tasks such as text deblurring [79, 80]

, image colorization 

[81], and single image super resolution [82].

While the above approaches employ object recognition in addition to visual enhancement, there are approaches designed to overlook the visual appearance of the image and instead make use of enhancement techniques to exclusively improve the object recognition performance. Sharma et al. [83]

make use of dynamic enhancement filters in an end-to-end processing and classification pipeline that incorporates two loss functions (enhancement and classification). The approach focuses on improving the performance of challenging high quality images. In contrast to this, Yim

et al. [10] propose a classification architecture (comprised of a pre-processing module and a neural network model) to handle images degraded by noise. Li et al. [84] introduced a dehazing method that is concatenated with Faster R-CNN and jointly optimized as a unified pipeline. It outperforms traditional Faster R-CNN and other non-joint approaches.

Additional work has been undertaken in using visual enhancement techniques to improve high-level tasks such as face recognition  [85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97] (through the incorporation of deblurring, super-resolution, and hallucination techniques) and person re-identification [98] algorithms for video surveillance data.

3 A New Evaluation Regime for Image Restoration and Enhancement

Our goal in this work is to provide insights on the impact of image enhancement algorithms in visual recognition tasks and find out which of them, in conjunction with the strongest features and supervised machine learning approaches, are promising candidates for different problem domains. To support this, we designed two evaluation tasks: (1) enhancement to facilitate manual inspection, where algorithms produce enhanced images to facilitate human assessment, and (2) enhancement to improve object recognition, where algorithms produce enhanced images to improve object classification by state-of-the-art neural networks.

3.1 Enhancement to Facilitate Manual Inspection

The first task is an evaluation of the qualitative enhancement of images. Through this task, we wish to answer two questions: Did the algorithm produce an enhancement that agrees with human perceptual judgment? And to what extent was the enhancement an improvement or deterioration? Widely used proposed metrics such as SSIM [99] have attempted answer these questions by estimating human perception but have often failed to accurately imitate the nuances of human visual perception, and at times have caused algorithms to deteriorate perceptual quality [21].

While prior work has successfully adapted psychophysical methods from psychology as a means to directly study perceptual quality, these methods have been primarily posed as real-vs-fake tests [100]. With respect to qualitative enhancement, these real-vs-fake methods can only indicate if an enhancement has caused enough alteration to cause humans to cross the threshold of perception, and provides little help in answering the two questions we are interested in for this task. Zhang et al. [21] came close to answering these questions when they proposed LPIPS for evaluating perceptual similarity. However, this metric lacks the ability to measure whether the enhancement was an improvement or a deterioration (see analysis in Sec. 6).

Fig. 2: The visual enhancement task deployed on Amazon Mechanical Turk. An observer is presented with the original image and the enhanced image . The observer is then asked to select which label they perceive is most applicable. The selected label is converted to an integer value . The final rating for the enhanced image is the mean score from approximately observers. See Sect. 3.1 for further details.

In light of this, for our task we propose a new procedure, grounded in psychophysics, to evaluate the visual enhancement of images by answering both of our posed questions. The procedure for assessing image quality enhancement is a non-forced-choice procedure that allows us to take measurements of both the threshold of perceived change and the suprathresholds, which represent the degree of perceived change [101]. Specifically, we employ a bipolar labeled Likert Scale to estimate the amount of improvement or deterioration the observer perceives, once the threshold has been crossed. The complete procedure is as follows.

An observer is presented with an image positioned on the left-hand side of a screen and the output of the enhancement algorithm on the right. The observer is informed that is the original image and is the enhanced image. Below the image pair, five labels are provided and the observer is asked to select the label that most applies (see Fig. 2 for labels and layout). To capture as much of the underlying complexities in human judgment as possible, no criteria is provided for making a selection. To reduce any dependence on the subsystems in the visual cortex that specialize in memory, the pair of images is displayed until the observer selects their perceived label. An observer is given unlimited time, and informed that providing accurate labels is most important. For images larger than pixels, the observer has the option to enlarge the images and examine them in finer detail.

The label that is selected by the observer is then converted to an assigned ordinal value where and are the suprathreshold measurements of improvement and deterioration, respectively. A rating of is the superficial threshold imposed on the observer, which indicates that the enhancement was imperceptible. In our proposed procedure, there is no notion of accuracy in the measurement of qualitative enhancement, as it is left entirely up to the observer’s subjective discretion. However, when there are sampled observers, the perception of quality to the average observer can be estimated to provide a reliable metric for the evaluation of qualitative enhancement. We verified this holds true even when and are swapped (i.e., responses are symmetric).

To perform a large scale evaluation, we used Amazon’s Mechanical Turk (AMT) service, which is widely deployed for many related tasks in computer vision [21, 102, 1]. AMT allows a Requester (i.e., researcher) to hire Workers to perform a task for payment. Our task was for a Worker to participate in the rating procedure for 100 image pairs. An additional three sentinel images pairs were given to ensure that the Worker was actively participating in the rating procedure. Workers who failed to correctly respond to at least two of the three sentinel image pairs were not paid and their ratings were discarded. In total, we had over Workers rating each image enhancement approximately times. Out of that pool, successfully classified a majority of the sentinel image pairs (the ratings provided by the remaining workers were discarded). See Sec. 6 for results and analysis.

3.2 Enhancement to Improve Object Recognition

Fig. 3: Work flow for classification improvement: after enhancing the input frame with the candidate algorithm, the annotated objects in the frame are extracted and sent as input to each of the classification networks, which classify the image. Each algorithm’s score is then calculated as specified in Sec. 3.2.2.

The second task is an evaluation of the performance improvement enhanced images lead to when used as input to state-of-the-art image classification networks. When considering a fixed dataset, the evaluation protocol allows for the use of some within dataset training data (the data provided by the UG

dataset, described below, for training contains frame-level annotations of the object classes of interest), and as much out of dataset data as needed for training and validation purposes. In order to establish good baselines for classification performance before and after the application of image enhancement and restoration algorithms, this task makes use of a selection of deep learning approaches to recognize annotated objects and then scores results based on the classification accuracy. The Keras 

[103] versions of the pre-trained networks VGG16 and VGG19 [22], InceptionV3 [23], and ResNet50 [24] are used for this purpose.

Each candidate algorithm is treated as an image pre-processing step to prepare sequestered test images to be submitted to all four networks. After pre-processing, the objects of interest are cropped out of the images based on verified ground-truth coordinates. The cropped images are then used as input to the networks. Algorithms are evaluated based on any improvement observed over the baseline classification result (i.e., the classification scores of the un-altered test images). The work flow of this evaluation pipeline is shown in Fig. 3. To avoid introducing further artifacts due to down-sampling, we require each algorithm to produce an output frame of the same size as its input.

3.2.1 Classification Metrics

The networks used for the classification task return a list of the ImageNet synsets (ImageNet provides images for “synsets" or “synonym sets" of words or phrases that describe a concept in WordNet [104]

) along with the probability of the object belonging to each of the synset classes. However, (as will be discussed in Sec. 

4), in many cases it is impossible to provide an absolute labeling for the annotated objects. Consequently, most of the super-classes that can be considered are composed of more than one ImageNet synset. That is, each annotated image has a single super-class label which in turn is defined by a set of ImageNet synsets .

To measure accuracy, we observe the number of correctly identified synsets in the top-five predictions made by each pre-trained network. A prediction is considered to be correct if its synset belongs to the set of synsets in the ground-truth super-class label. We use two metrics for this. The first measures the rate of achieving at least one correctly classified synset class (M1). In other words, for a super-class label , a network is able to place one or more correctly classified synsets in the top-five predictions. The second measures the rate of placing all the correct synset classes in the super-class label synset set (M2). For example, for a super-class label , a network is able to place three correct synsets in the top-five predictions.

3.2.2 Scoring

Each image enhancement or restoration algorithm’s performance on the classification task is then calculated by applying one of the two metrics defined above for each of the four networks and each collection within the UG dataset. This results in scores for each metric (i.e., M1 or M2 scores from VGG16, VGG19, Inception, and ResNet for the UAV, Glider, and Ground collections). For the image enhancement and restoration algorithms we consider in this article, each is ranked against all other algorithms based on these scores. A score for an enhancement algorithm is considered “valid" if it was higher than that of the scores obtained by evaluating the classification performance of the un-altered images. In other words, we only consider a score valid if it improves upon the baseline classification task of classifying the original images. The number of valid scores in which an algorithm excels over the others being evaluated is then counted as its score in points for the task (for a maximum of points, achievable if an algorithm obtained the highest improvement — compared to all other competitors — in all possible configurations).

4 The UG Dataset

As a basis for the evaluation regime described above in Sec. 3, we collected a new dataset which we call UG (UAV, Glider, and Ground). It is available for download at the following site: http://www.ug2challenge.org/dataset18.html. The training and test datasets employed in the evaluation are composed of annotated frames from three different video collections (Fig. 4 presents example frames from each). The annotations provide bounding boxes establishing object regions and classes, which were manually annotated using the Vatic tool for video annotation [105]. For running classification experiments the objects were cropped from the frames in a square region of at least pixels (a common input size for many deep learning-based recognition models), using the annotations as a guide.

Each annotation in the dataset indicates the position, scale, visibility, and super-class of an object in a video. The need for high-level classes (super-classes) arises from the challenge of performing fine-grained object recognition using aerial collections, which have a high variability in both object scale and rotation. These two factors make it difficult to differentiate some of the more fine-grained ImageNet categories. For example, while it may be easy to recognize a car from an aerial picture taken from hundreds (if not thousands) of feet above the ground, it might be impossible to determine whether that car is a taxi, a jeep or a sports car. Thus we defined super-classes that encompass multiple visually similar ImageNet synsets, as well as evaluation metrics that allow for a coarse-grained classification evaluation of such cases (see Sec. 3.2.1). The three different video collections consist of:

(a) UAV Collection
(b) Glider Collection
(c) Ground Collection
Fig. 4: Examples of images in the three UG collections.

(1) UAV Video Collection: Composed of clips recorded from small UAVs in both rural and urban areas, the videos in this collection are open source content tagged with a Creative Commons license, obtained from YouTube. Because of the source, they have different video resolutions (from to ), objects of interest sizes (cropped objects with sizes ranging from to ), and frame rates (from FPS to FPS). This collection has distortions such as glare/lens flare, compression artifacts, occlusion, over/under exposure, camera shaking (present in some videos that use autopilot telemetry), sensor noise, motion blur, and fish eye lens distortion. Videos with problematic scene/weather conditions such as night/low light video, fog, cloudy conditions and occlusion due to snowfall are also included.

(2) Glider Video Collection: Consists of videos recorded by licensed pilots of fixed wing gliders in both rural and urban areas. The videos have frame rates ranging from FPS to FPS, objects of interest sizes ranging from to , and different types of compression such as MTS, MP4 and MOV. The videos mostly present imagery taken from thousands of feet above ground, further increasing the difficulty of object recognition. Additionally, the scenes contain artifacts such as motion blur, camera shaking, noise, occlusion (which in some cases is pervasive throughout the videos, showcasing parts of the glider that partially occlude the objects of interest), glare/lens flare, over/under exposure, interlacing, and fish eye lens distortion. This collection also contains videos with problematic weather conditions such as fog, clouds and occlusion due to rain.

(3) Ground Video Collection: In order to provide some ground-truth with respect to problematic image conditions, this collection contains videos captured at ground level with intentionally induced artifacts. These videos capture static objects (e.g., flower pots, buildings) at a wide range of distances (ft, ft, ft, ft, ft, ft, ft, and ft), and motion blur induced by an orbital shaker to generate horizontal movement at different rotations per minute (rpm, rpm, rpm, and rpm). Additionally, this collection includes videos under different weather conditions (sun, clouds, rain, snow) that could affect object recognition. We used a Sony Bloggie hand-held camera (with resolution and a frame rate of 60 FPS) and a GoPro Hero 4 (with resolution and a frame rate of 30 FPS), whose fisheye lens introduced further distortion. Furthermore, we provide an additional class of videos (resolution-chart) showcasing a inch checkerboard grid exhibiting all the aforementioned distances at all intervals of rotation. The motivation for including this additional class is to provide a reference for camera calibration used for ground data and to aid participants in finding the distortion measures of the cameras used.

TABLE I: Summary of the UG training dataset

Training Dataset. The training dataset is composed of videos with frames, representing ImageNet [1] classes extracted from annotated frames from the three different video collections. These classes are further categorized into super-classes encompassing visually similar ImageNet categories and two additional classes for pedestrian and resolution chart images. Furthermore, the dataset contains a subset of object-level annotated images and the videos are tagged to indicate problematic conditions. Table I summarizes the training dataset.

Testing Dataset. The testing dataset is composed of videos with frame-level annotations. Out of the annotated frames, disjoint frames were selected among the three different video collections, from which we extracted objects. These objects are further categorized into super-classes encompassing visually similar ImageNet categories. While most of the super-classes in the testing dataset overlap with those in the training dataset, there are some classes unique to each. Table II summarizes the testing dataset.

TABLE II: Summary of the UG testing dataset

5 Novel and Baseline Algorithms

Six competitive teams participated in the 2018 UG Workshop held at CVPR, each submitting a novel approach for image restoration and enhancement meant to address the evaluation tasks we described in Sec. 3. In addition, we assessed eight different classical and deep learning-based baseline algorithms from the literature.

5.1 Challenge Workshop Entries

The six participating teams were Honeywell ACST, Northwestern University, Texas A&M and Peking University, National Tsing Hua University, Johns Hopkins University, and Noblis. Each team had a unique take on the problem, with an approach designed for one or both of the evaluation tasks.

5.1.1 Camera and Conditions-Relevant Enhancements (CCRE)

Honeywell ACST’s algorithmic pipeline was motivated by a desire to closely target image enhancements in order to avoid the counter-productive results that the UG dataset has highlighted [5]. Fig. 5 illustrates their approach. Of the wide range of image enhancement techniques, there is a smaller subset of enhancements which may be useful for a particular image. To find this subset, the CCRE pipeline considers the intersection of camera-relevant enhancements with conditions-relevant enhancements. Examples of camera-relevant enhancements include de-interlacing, rolling shutter removal (both depending on the sensor hardware), and de-vignetting (for fisheye lenses). Example conditions-relevant enhancements include de-hazing (when imaging distant objects outdoors) and raindrop removal. To choose among the enhancements relevant to various environmental conditions and the camera hardware, CCRE makes use of defect-specific detectors.

This approach, however, requires a measure of manual tuning. For the evaluation task targeting human vision-based image quality assessment, manual inspection revealed severe interlacing in the glider set. Thus a simple interlacing detector was designed to separate each frame into two fields (comprised of the even and odd image rows, respectively) and compute the horizontal shift needed to register the two. If that horizontal shift was greater than

pixels, then the image was deemed interlaced, and de-interlacing was performed by linearly interpolating the rows to restore the full resolution of one of the fields.

For the evaluation task targeting automated object classification, de-interlacing is also performed with the expectation that the edge-type features learned by the VGG network will be impacted by jagged edges from interlacing artifacts. Beyond this, a camera and conditions assessment is partially automated using a file analysis heuristic to determine which of the collections a given video frame came from. While interlacing was the largest problem with the glider images, the ground and UAV collections were degraded by compression artifacts. Video frames from those collections were processed with the Fast Artifact Reduction CNN 

[51].

Fig. 5: The CCRE approach conditionally selects enhancements.

5.1.2 Multiple Artifact Removal CNN (MA-CNN)

The Northwestern team focused their attention on three major causes of artifacts in an image: (1) motion blur, (2) de-focus blur and (3) compression algorithms. They observed that in general, traditional algorithms address inverse problems in imaging via a two step procedure: first by applying proximal algorithms to enforce measurement constraints and then by applying natural image priors (sometimes denoisers) on the resulting output [106, 107]. Recent trends in inverse imaging algorithms have focused on developing a single algorithm or network to address multiple imaging artifacts [108]. These networks are alternately applied to denoise and deblur the image. Building on the above principle, the MA-CNN learning-based approach was developed to remove multiple artifacts in an image. A training dataset was created by introducing motion, de-focus and compression artifacts into images from ImageNet. The motion-blur was introduced by using a kernel with a fixed length and random direction for each of the images in the training dataset. The defocus blur was introduced by using a Gaussian kernel with a fixed standard variance . The parameters {} were tuned to create a perceptually improved result.

MA-CNN is a fully convolutional neural network architecture with residual skip connections to generate the enhanced image. The network architecture is shown in Fig. 6

. In order to get better visual quality results, a perceptual loss function that uses the first four convolutional layers of pre-trained VGG-16 network is incorporated.

Fig. 6: Network architecture used for MA-CNN image enhancement.

By default, the output of the MA-CNN contains checkerboard artifacts. Since these checkerboard artifacts are periodic, they can be removed by suppressing the corresponding frequencies in the Fourier-domain. Moreover all images (of the same size) generated with the network have artifacts in a similar region in the Fourier domain. For images of differing sizes, the distance of the center of the artifact from the origin is proportional to its size.

5.1.3 Cascaded Degradation Removal Modules (CDRM)

The Texas A&M team observed that independently removing any single type of degradation could in fact undermine performance in the object recognition evaluation task, since other degradations were not simultaneously considered and those artifacts might be amplified during this process. Consequently, they proposed a pipeline that consists of sequentially cascaded degradation removal modules to improve recognition. Further, they observed that different collections within the UG dataset had different degradation characteristics. As such, they proposed to first identify the incoming images as belonging to one of the three collections, and then deploy a specific processing model for each collection. The resulting pipeline, an ensemble of three strategies, is depicted in Fig. 7. In their model they adopted six different enhancement modules.

(1) Histogram Equalization balances the distribution of pixel intensities and increases the global contrast of images. To do this, Contrast Limited Adaptive Histogram Equalization (CLAHE) is adopted [109]. The image is partitioned into regions and the histogram of the intensities in each is mapped to a more balanced distribution. As the method is applied at the region level, it is more robust to locally strong over-/under-exposures and can preserve edges better. (2) Given that removing blur effects is widely found to be helpful in fast-moving aerial cameras, and/or in low light filming conditions, Deblur GAN [110] is employed as an enhancement module in which, with adversarial training, the generator in the network is able to transform a blurred image to a visually sharper one. (3) Recurrent Residual Net for Super-Resolution was previously proposed in [111]. Due to the large distance between objects and aerial cameras, low resolution is a bottleneck for recognizing most objects from UAV photos. This model is a recurrent residual convolutional neural network consisting of six layers and skip-connections. (4) Deblocking Net [112] is an auto-encoder-based neural network with dilation convolutions to remove blocking effects in videos, which was fine-tuned using the VGG-19 perceptual loss function, after training using JPEG-compressed images. Since lossy video coding for on-board sensors introduced blocking effects in many frames, the adoption of the deblocking net was found to suppress visual artifacts. (5) RED-Net [113] is trained to restore multiple mixed degradations, including noise and low resolution together. Images with various noise levels and scale levels are used for training. The network can improve the overall quality of images. (6) HDR-Net [114] can further enhance the contrast of images to improve the quality for machine and human analysis. This network learns to produce a set of affine transformations in bilateral space to enhance the image while preserving sharp edges.

Fig. 7: The CDRM enhancement pipleline. If the glider set is detected, no action is taken (recognition is deemed to be good enough by default).

5.1.4 Tone Mapping Deep Image Prior (TM-DIP)

The main idea of the National Tsing Hua University team’s approach was to derive deep image priors for enhancing images that are captured from a specific scene with certain poor imaging conditions, such as the UG collections. They consider the setting that the high-quality counterparts of the poor-quality input images are unavailable, and hence it is not possible to collect pairwise input/output data for end-to-end supervised training to learn how to recover the sharp images from blurry ones.

The method of deep image prior presented by Ulyanov et al. [115] can reconstruct images without using information from ground-truth sharp images. However, it usually takes several minutes to produce a prior image by training an individual network for each image. Thus a new method was designed to replace the per-image prior model of [115] by a generic prior network. This idea is feasible since images taken in the same setting, e.g., the UG videos, often share similar features. It is not necessary to have a distinct prior model for each image. One can learn a generic prior network that takes every image as the input and generates its corresponding prior as the output.

At training time, the method from [115] is used to generate image pairs for training a generic prior network, where is an original input image and is its corresponding prior image. The generic prior network adopts an encoder-decoder architecture with skip connections as in [116]. At inference time, given a new image, its corresponding prior image is efficiently obtained from the learned generic prior network, with tone mapping then applied to enhance the details.

It was observed that the prior images obtained by the learned generic prior network usually preserve the significant structure of the input images but exhibit fewer details. This observation, therefore, led to a different line of thought on the image enhancement problem. By comparing the prior image with the original input image, details for enhancement may be extracted. Thus, the tone mapping technique presented in [117] was used to enhance the details:

(1)

where is the input image, refers to the prior image, the ratio can be considered as the details, and is a factor for adjusting the degree of detail-enhancement. With the tone-mapping function in Eq. (1), the local details are detached from the input image, and the factor is subsequently adjusted to obtain an enhanced image .

5.1.5 Satellite Images Super-Resolution (SSR)

The team from Johns Hopkins University proposed a neural network-based approach to apply super-resolution on images. They trained their model on satellite imagery, which has an abundance of detailed features. Their network is fully convolutional, and takes as input an image of any resolution and outputs an image that is exactly double the original input in width and height.

The network is constrained to pixel patches of the image with an “apron" of pixels for an overlap. This results in a output where the outer

pixels are ignored, as they are the apron — they mirror the edge to “pad" the image. These segments are then stitched together to form the final image. The network consists of five convolutional layers; see Table 

III for details.

TABLE III: The SSR network layers.

Most of the network’s layers contain kernels, and hence are just convolutionalized fully connected layers. This network structure is appropriate for a super-resolution task because it can be equated to a regression problem where the input is a dimension (

) vector leading to

dimensional () vector. The first convolutional layer is necessary to maintain the spatial relationships of the visual features through the kernel.

The SpaceNet dataset[118] is used to train this network, and is derived from satellite-based images. Images were downsampled and paired with the originals. Training took place for epochs using an L2 + L1 combined loss and the Adam optimizer in Keras/Tensorflow [103].

5.1.6 Style-Transfer Enhancement Using GANs (ST-GAN)

Noblis attempted a style-transfer approach for improving the quality of the UG2 imagery. Since the classification networks used in the UG2 evaluation protocol were all trained on ImageNet, a CycleGAN [119] variant was trained to translate between the UAV and drone collections and ImageNet, using LSGAN [120] losses. The architecture was based on the original CycleGAN paper, with modified generators adding skip connections between same spatial resolution convolutional layers on both sides of the residual blocks (in essence a U-Net [116] style network), which appeared to improve retention of details in the output images. The UG2 to ImageNet generator was also made to perform upscaling (by adding two strided convolutional layers after the first convolutional layer), and it was also made to perform downscaling (by adding two stride convolutional layers after the first convolutional layer). The discriminators were left unmodified. Networks were trained using patches selected from the UG2 images, and ImageNet images cropped and resized to . UG2 patches were selected by randomly sampling regions around the ground-truth annotation bounding boxes, mainly to avoid accidentally sampling flat-colored patches.

However, several problems were initially encountered when optimizing the network. Optimization would fail outright, unless it employed some form of normalization. However, it was not possible to employ the typical batch-norm [121], due to small batch sizes. Instead, instance normalization [122] was initially used, which proved reasonably effective. However, even after avoiding immediate optimization failure, training a network with just the cycle-consistency and GAN losses can lead to mode failures such as both generators performing color inversion within a few thousand iterations. Adding the identity mapping losses (i.e., loss terms for , and ) discussed in the original CycleGAN paper proved effective in avoiding these kinds of failures.

Since the UG

evaluation protocol specifies the enhancement of full video frames, either a larger input to the generator must be used (which seemed feasible considering a fully-convolutional architecture), or the input image must be divided into tiles. In either case, instance norm led to poor results on tiles — there was a clear boundary between tiles due to different local statistics, and on full frames the output images had a skewed color distribution due to a difference in statistics over different sized inputs. To combat this, instance norm was replaced with an operation that performed normalization independently down the channels of each pixel. This stabilized convergence, and did not cause problems when tiling out large images.

5.2 Baseline Algorithms

Here we describe a number of other algorithms we used as a pre-processing step to the recognition models. These serve as canonical references or baselines against which the algorithms in Sec. 5.1 were tested. We used both classical methods and state-of-the-art deep learning-based methods for image interpolation [123], super-resolution [20, 60], and deblurring [124, 125]

Classical Methods. For image enhancement, we used three different interpolation methods (bilinear, bicubic and nearest neighbor) [123] and a single restoration algorithm (blind deconvolution [124]). The interpolation algorithms attempt to obtain a high resolution image by up-sampling the source low-resolution image and by providing the best approximation of a pixel’s color and intensity values depending on the nearby pixels. Since they do not need any prior training, they can be directly applied to any image. Nearest neighbor interpolation uses a weighted average of the nearby translated pixel values in order to calculate the output pixel value. Bilinear interpolation increases the number of translated pixel values to two and bicubic interpolation increases it to four. Different from image enhancement, in image restoration the degradation, which is the product of motion or depth variation from the object or the camera, is modelled. The blind deconvolution algorithm can be used effectively when no information about the degradation (blur and noise) is known [126]. The algorithm restores the image and the point-spread function (PSF) simultaneously. We used Matlab’s blind deconvolution algorithm, which deconvolves the image using the maximum likelihood algorithm, with a array of s as the initial PSF.

Deep Learning-Based Methods. With respect to state-of-the-art deep learning-based super-resolution algorithms, we tested the Super-Resolution Convolutional Neural Network (SRCNN) [20] and Very Deep Super Resolution (VDSR) [60]. The SRCNN method employs a feed-forward deep CNN to learn an end-to-end mapping between low resolution and high resolution images. The network was trained on million “sub-images" generated from images of the ILSVRC 2013 ImageNet detection training partition [1]. The VDSR algorithm [60] outperforms SRCNN by employing a deeper CNN inspired by the VGG architecture [22] and decreases training iterations and time by employing residual learning with a very high learning rate for faster convergence. Unlike SRCNN, the network is capable of handling different scale factors.

With respect to deep learning-based image restoration algorithms, we tested Deep Dynamic scene Deblurring [125], which was designed to address camera shake blur. However, the results presented in [125] indicated that this method can obtain good results for other types of blur. The algorithm employs a CNN that was trained with video frames containing synthesized motion blur such that it receives a stack of neighboring frames and returns a deblurred frame. The algorithm allows for three types of frame-to-frame alignment: no alignment, optical flow alignment, and homography alignment. For our experiments we used optical flow alignment, which was reported to have the best performance with this algorithm. We had originally evaluated an additional video deblurring algorithm proposed by Su et al. [16]. However, this algorithm employs information across multiple consecutive frames to perform its deblurring operation. Given that the training and testing partitions of the UG dataset consist of disjoint video frames, we omitted this method to provide a fair comparison.

With respect to image enhancement specifically to improve classification, we tested the recently released algorithm by Sharma et al. [83]

. This approach learns a dynamic image enhancement network with the overall goal to improve classification, but not necessarily human perception of the image. The proposed architecture enhances image features selectively, in such a way that the enhanced features provide a valuable improvement for a classification task. High quality (

i.e., free of visual artifacts) images are used to train a CNN to learn a configuration of enhancement filters that can be applied to an input image to yield an enhanced version that provides better classification performance.

(a) UAV Collection
(b) Glider Collection
(c) Ground Collection
Fig. 8: Distribution of LPIPS similarity between original and enhanced image pairs and the human perceived improvement / deterioration for each of the collections. Images human raters considered as having a high level of improvement tended to also have low LPIPS scores, while images with higher LPIPS scores tended to be rated negatively by human observers.

6 Results & Analysis

In the following analysis, we review the results that came out of the UG Workshop held at CVPR 2018, and discuss additional results from the slate of baseline algorithms.

6.1 Enhancement to Facilitate Manual Inspection (UG Evaluation Task 1)

There is a recent trend to use deep features, as measured by the difference in activations from the higher convolutional layers of a pre-trained network for the original and reconstructed images, as a perceptual metric — the motivation being deep features somewhat mimic human perception. Zhang

et al. [21] evaluate the usability of these deep features as a measure of human perception. Their goal is to find a perceptual distance metric that resembles human judgment. The outcome of their work is the LPIPS metric, which measures the perceptual similarity between two images ranging from , meaning exactly similar, to , which is equivalent to two completely dissimilar images. Here we compare this metric (i.e., similarity between the original image and the output of the evaluated enhancement algorithms) directly to human perception, which we argue is the better reference point for such assessments.

We used the most current version (v.0.1) of the LPIPS metric with a pre-trained, linearly calibrated AlexNet model. As can be observed in Fig. 8, the four novel algorithms that were submitted by participants for the first UG evaluation task have very heterogeneous effects on different images of the dataset, with LPIPS scores ranging all the way from (no perceptual dissimilarity) to (moderate dissimilarity). This effect is accentuated for the images in the UAV Collection (Fig. (a)a), which yields more variance in LPIPS scores, whereas LPIPS scores for the remaining two collections (Fig. (b)b, (c)c) remained between and for most of these algorithms. We observed a similar effect for some of our baseline algorithms (Fig. (a)a), particularly for Blind Deconvolution (BD), which sharpened images, but also amplified artifacts that were already present.

(a) Baseline Algs.
(b) Top Participant vs. Baseline Algs.
Fig. 9: Distribution of LPIPS similarity between original and enhanced image pairs of the test dataset and the human perceived improvement / deterioration for all of the collections within the UG dataset.

However, we observed that for both the participant and baseline algorithms, the images human raters considered as having a high level of improvement tended to also have low LPIPS scores (usually between and ), while images with higher LPIPS scores tended to be rated negatively by human observers (see Figs. 8 and (a)a). A similar behaviour was observed by Zu et al. [127]. They suggest that high LPIPS scores might indicate the presence of unnatural looking images. Contrasting the results of the two best participant and baseline algorithms (Fig. (b)b), we observed that for the baseline algorithms and the CCRE approach the LPIPS and human rating distributions were more tightly grouped than MA-CNN. Nevertheless, their changes to images tended to be considered small by both human raters and the LPIPS metric, with a user rating closer to (no change) and LPIPS scores between and . In contrast, changes induced by the MA-CNN method reached extremes of and for LPIPS score and human rating respectively, flagging the presence of very noticeable, but in some cases detrimental, changes. This is a further constraint of the LPIPS metric.

(a) Participant Algs.
(b) Baseline Algs.
(c) Top Participant vs. Top Baseline Algs.
Fig. 10: Comparison of perceived visual improvement for all collections after applying enhancement algorithms.

It is important to note that while we calculated the mean user rating of all the workshop participant submissions and baseline algorithms, it was not possible to obtain the LPIPS scores for any of the super-resolution approaches. This would have required us to down-sample enhanced images to be of the same size as that of the original images, which would have negated the improvement of such methods.

Focusing just on the image improvement / deterioration as perceived by human raters, we can turn to Fig. (a)a for the performance of all algorithms, including the super-resolution approaches, submitted by each team. It is important to note that while most of the algorithms tended to improve the visual quality of the images they were presented with, a large fraction of the images they enhanced tended to have an average score between and . This means that the human raters were not able to find any significant difference between the enhanced and original image pairs.

The best performing algorithm submitted for this task, CCRE, was able to improve the visual quality of images, even though the algorithm did not appear to perform any significant changes to most of the images ( images had a rating between and ). The enhancement applied was considered a subtle improvement in most scenarios: images had a score between and , with the remaining having a higher improvement score between and . However, only of the modified images were considered to degrade the image quality, and even then they had a rating of between and , which means that the degradation was very small.

As mentioned previously, the visual changes generated by the runner-up enhancement algorithm MA-CNN were more explicit than those present in CCRE. While the number of images that were considered to be improved was smaller ( improved images), of them were between the range of and , indicating a good measure of higher visual quality. Nevertheless, the sharp changes introduced by this algorithm also seemed to increase the perceived degradation on a larger portion of the images, with almost having a score between and (thus indicating a significant deterioration of the image quality).

With respect to the baseline algorithms, we observed a less dramatic perception of quality degradation. Given that most of the algorithms tested were focused on enhancing the image resolution (by performing image interpolation or super-resolution), they were more prone to perform very subtle changes on the structure of the image. This is reflected in Fig. (b)b. Fig. (c)c shows a side by side comparison of the two best baselines (VDSR and Nearest Neighbor interpolation) and the two best performing participant submissions.

Fig. 11: Classification rates at rank 5 for the original, un-processed, frames for each collection in the training and testing datasets.

6.2 Evaluation of Object Recognition Performance (UG Evaluation Task 2)

TABLE IV: Classification scores for the un-altered images of the testing dataset. The evaluated algorithms were expected to improve these scores.

For this evaluation task the participants were expected to provide enhancement techniques catering to machine vision rather than human perception. Table IV and Fig. 11 depict the baseline classification results for the UG training and testing datasets, without any restoration or enhancement algorithm applied, at rank 5. Given the very poor quality of its videos, the UAV Collection proved to be the most challenging for all networks in terms of object classification, leading to the lowest classification performance out of the three collections. While the Glider Collection shares similar problematic conditions with the UAV Collection, the images in this collection lead to a higher classification rate than those in the UAV Collection in terms of identifying at least one correctly classified synset class (metric M1). This improvement might be caused by the limited degree of movement of the gliders, since it ensures that the displacement between frames was kept more stable over time, as well as a higher recording quality (taking into consideration the camera weight limitations present in small UAVs are no longer a limiting factor for this collection’s videos). The controlled Ground Collection yielded the highest classification rates, which, in an absolute sense, are still low (the highest classification rate being for metric M1 and for metric M2 for the testing dataset; see Table IV). The participants of the UG workshop were expected to develop algorithms to improve upon the baseline scores.

While there are some correlations between the improvement of visual quality as perceived by humans and high level tasks such as object classification or face recognition performed by networks, research by Sharma et al. [83] suggests that image enhancement focused on improving image characteristics valuable for object recognition can lead to an increase in classification performance. Thus we tried such a technique as an initial experiment and compared it to the baseline results. Fig. 12 shows the performance of the five enhancement filters proposed by Sharma et al. on our dataset. It is important to note that said filters were the un-altered filters Sharma et al. — trained making use of good quality images. This is because that work was focused on improving the classification performance of images with few existing perturbations. As such, the effect on improving highly corrupted images is much different from that obtained on a standard dataset of images crawled from the web. The results in Fig. 12 establish that even existing deep learning networks designed for this task cannot achieve good classification rates for UG due to the domain shift in training.

(a)
(b)
(c)
Fig. 12: Comparison of classification rates at rank 5 for each collection after applying classification driven image enhancement algorithms by Sharma et al. [83]. Markers in red indicate results on original images.

As can be observed in Table V and Fig. 13, the four novel algorithms submitted by participants for the second UG evaluation task excelled in the processing of certain collections while falling short in others. Most of the submitted algorithms were able to improve the classification performance of the images in the Ground Collection, but they struggled in improving the classification for the aerial collections, whose scenes tend to have a higher degree of variability than those present in the Ground Collection. Only the CCRE algorithm was able to improve the performance of one of the metrics for the UAV Collection (the M2 metric for the ResNet network, with a improvement over the baseline). The MA-CNN algorithm was able to improve two of the Glider Collection metrics (the M2 metric for the VGG16 and VGG19 networks with and improvement respectively over the baselines). For the most part algorithms tended to improve the metrics for the Ground Collection, with the highest classification improvement being provided by the CDRM method, with an improvement of and over the baselines for the Inception M1 and M2 metrics.

TABLE V: Highest classification rates for the evaluated participant enhancement algorithms that improved the classification performance of the un-altered test images. Improvement was not achieved in all cases.

Further along these lines, while both metrics saw some improvement, the M2 metric benefited the most from these enhancement algorithms. This behavior is more pronounced when examined in the context of classic vs. state-of-the-art algorithms (Fig. 14). While the baseline enhancement algorithms had moderate improvements in both metrics, the participant algorithms seemed to favor the M2 metric over the M1 metric. For example, while the highest improvement (on the Ground Collection) for the VGG19 network in M1 was , the improvement for the same network in M2 was . This leads us to conclude that the effect these algorithms have on automatic object recognition would be that of increasing within-class classification certainty. In other words, they would make an object belonging to a super-class become a better representation of such class features, such that the networks are able to detect more members of that class in their top 5 predictions.

(a)
(b)
(c)
Fig. 13: Comparison of classification rates at rank 5 for each collection after applying the four algorithms submitted by teams for this task.
(a)
(b)
(c)
Fig. 14: Performance comparison between the two best performing participant algorithms and two best baseline enhancement algorithms.

7 Discussion

The area of computational photography has a long standing history of generating algorithms to improve image quality. However, the metrics those algorithms tend to optimize do not correspond to human perception. And as we found out, this makes them unreliable for the task of object recognition. Given this, how do we design an enhancement & recognition pipeline that simultaneously corrects imaging artifacts and recognizes objects of interests? Skeptics may argue that re-training or fine-tuning existing recognition networks can solve the problem. As promising as this may sound, training a network involves a large cost in terms of time and resources. And if we were to train these networks each time for a new task or dataset, we would be ignoring other paths to solving this problem. The UG2 dataset paves the way for this new research through the introduction of a benchmark that is well matched to the problem area. The challenge is to jointly optimize two seemingly divergent but relevant tasks of (1) image enhancement and restoration and (2) object recognition. The first iteration of the challenge workshop making use of this dataset saw participation of teams from around the globe and introduced six unique algorithms to bridge the the gap between computational photography and recognition, which we have described in this article. As noted by some participants and in accordance with our initial results, the problem is still not solved — improving image quality using existing techniques does not necessarily solve the recognition problem.

The results of our experiments led to some surprises. Even though the restoration and enhancement algorithms tended to improve the classification results for the diverse imagery included in our dataset, no approach was able to uniformly improve the results for all of the candidate networks. Moreover, in some cases, performance degraded after image pre-processing, particularly for frames with higher amounts of image aberrations. This highlights the often single focus nature of image enhancement algorithms, which tend to specialize in removing a specific kind of artifact from an otherwise good quality image, which might not always be the present. Some of the algorithmic advancements (e.g., MA-CNN and CDRM) developed as a product of this challenge seek to address the problem by incorporating techniques such as deblurring, denoising, deblocking, and super-resolution into a single pre-processing pipeline. Such a practice pointed out the fact that while the individual implementation of some of these techniques might be detrimental to the visual quality or visual recognition task, when applied in conjunction with other enhancement techniques their effect turned out to be beneficial for both of these objectives.

We also found out that image quality is a subjective assessment and better left to humans who are physiologically tuned to notice higher variations and artifacts in images as a result of evolution. Based on this observation, we developed a psychophysics-based evaluation regime for human assessment and a realistic set of quantitative measures for object recognition performance. The code for conducting such studies will be made publicly available following the publication of this article.

Inspired by the success of the UG challenge workshop held at CVPR 2018, we intend to hold subsequent iterations with associated competitions based on the UG dataset. These workshops will be similar in spirit to the PASCAL VOC and ImageNet workshops that have been held over the years and will feature new tasks, extending the reach of UG2 beyond the realm of image quality assessment and object classification.

Acknowledgments

Funding for this work was provided under IARPA contract #2016-16070500002, and NSF DGE #1313583. This workshop is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA). The views and conclusions contained herein are those of the organizers and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

Hardware support was generously provided by the NVIDIA Corporation, and made available by the National Science Foundation (NSF) through grant #CNS-1629914. We thank Drs. Adam Czajka, and Christopher Boehnen for conducting an impartial judgment for the challenge tracks to determine the winners, Mr. Vivek Sharma for executing his code on our data and providing us with the result, Kelly Malecki for her tireless effort in annotating the test dataset for the UAV and Glider collections, and Sandipan Banerjee for assistance with data collection.

References

  • [1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “ImageNet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015.
  • [2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
  • [3] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in IEEE ICCV, 2015.
  • [4] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in IEEE CVPR, 2014.
  • [5] R. G. Vidal, S. Banerjee, K. Grm, V. Struc, and W. J. Scheirer, “UG: A video benchmark for assessing the impact of image restoration and enhancement on automatic visual recognition,” in IEEE WACV, 2018.
  • [6] S. Dodge and L. Karam, “Understanding how image quality affects deep neural networks,” in QoMEX, 2016.
  • [7] G. B. P. da Costa, W. A. Contato, T. S. Nazaré, J. do E. S. Batista Neto, and M. Ponti, “An empirical study on the effects of different types of noise in image classification tasks,” CoRR, vol. abs/1609.02781, 2016.
  • [8] S. Dodge and L. Karam, “A study and comparison of human and deep learning recognition performance under visual distortions,” in ICCCN, 2017.
  • [9] H. Hosseini, B. Xiao, and R. Poovendran, “Google’s cloud vision api is not robust to noise,” in IEEE ICMLA, 2017.
  • [10] J. Yim and K. Sohn, “Enhancing the performance of convolutional neural networks on quality degraded datasets,” CoRR, vol. abs/1710.06805, 2017.
  • [11] B. RichardWebster, S. E. Anthony, and W. J. Scheirer, “Psyphy: A psychophysics driven evaluation framework for visual recognition,” IEEE T-PAMI, 2018, to Appear.
  • [12] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in International Conference on Curves and Surfaces, 2010.
  • [13] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” in BMVC, 2012.
  • [14] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchored neighborhood regression for fast super-resolution,” in ACCV, 2014.
  • [15] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in IEEE CVPR, 2015.
  • [16] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O. Wang, “Deep video deblurring,” CoRR, vol. abs/1611.08387, 2016. [Online]. Available: http://arxiv.org/abs/1611.08387
  • [17] S. Diamond, V. Sitzmann, S. P. Boyd, G. Wetzstein, and F. Heide, “Dirty pixels: Optimizing image classification architectures for raw sensor data,” CoRR, vol. abs/1701.06487, 2017.
  • [18] M. Haris, G. Shakhnarovich, and N. Ukita, “Task-driven super resolution: Object detection in low-resolution images,” CoRR, vol. abs/1803.11316, 2018.
  • [19] M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch, “Enhancenet: Single image super-resolution through automated texture synthesis,” CoRR, vol. abs/1612.07919, 2016.
  • [20] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in ECCV, 2014.
  • [21] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in IEEE CVPR, 2018.
  • [22] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
  • [23] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in IEEE CVPR, 2016.
  • [24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE CVPR, 2016.
  • [25] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, “Understanding and evaluating blind deconvolution algorithms,” in

    Computer Vision and Pattern Recognition (CVPR), 2009.

    , June 2009.
  • [26] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in IEEE CVPR Workshops, 2017.
  • [27] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, “A statistical evaluation of recent full reference image quality assessment algorithms,” IEEE T-IP, vol. 15, no. 11, pp. 3440–3451, 2006.
  • [28] C.-Y. Yang, C. Ma, and M.-H. Yang, “Single-image super-resolution: A benchmark,” in ECCV, 2014.
  • [29] R. B. Fisher, “The PETS04 surveillance ground-truth data sets,” in IEEE PETS Workshop, 2004.
  • [30] M. Grgic, K. Delac, and S. Grgic, “SCface–surveillance cameras face database,” Multimedia Tools and Applications, vol. 51, no. 3, pp. 863–879, 2011.
  • [31] J. Shao, C. C. Loy, and X. Wang, “Scene-independent group profiling in crowd,” in IEEE CVPR, 2014.
  • [32] X. Zhu, C. C. Loy, and S. Gong, “Video synopsis by heterogeneous multi-source correlation,” in IEEE ICCV, 2013.
  • [33] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. C. Chen, J. T. Lee, S. Mukherjee, J. K. Aggarwal, H. Lee, L. Davis, E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song, A. Fong, A. Roy-Chowdhury, and M. Desai, “A large-scale benchmark dataset for event recognition in surveillance video,” in IEEE CVPR, 2011.
  • [34] P. Zhu, L. Wen, X. Bian, L. Haibin, and Q. Hu, “Vision meets drones: A challenge,” arXiv preprint arXiv:1804.07437, 2018.
  • [35] “UCF Aerial Action data set,” http://crcv.ucf.edu/data/UCF_Aerial_Action.php.
  • [36] “UCF-ARG data set,” http://crcv.ucf.edu/data/UCF-ARG.php.
  • [37] M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator for UAV tracking,” in ECCV, 2016.
  • [38] B. Yao, X. Yang, and S.-C. Zhu, “Introduction to a large-scale general purpose ground truth database: Methodology, annotation tool and benchmarks,” in Intl. Conference on Energy Minimization Methods in Computer Vision and Pattern Recognition, 2007.
  • [39] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, “Efficient marginal likelihood optimization in blind deconvolution,” in IEEE CVPR, 2011.
  • [40] A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image and depth from a conventional camera with a coded aperture,” ACM Transactions on Graphics (TOG), vol. 26, no. 3, 2007.
  • [41] N. Joshi, C. L. Zitnick, R. Szeliski, and D. J. Kriegman, “Image deblurring and denoising using color priors,” in IEEE CVPR, 2009.
  • [42] A. Levin and B. Nadler, “Natural image denoising: Optimality and inherent bounds,” in IEEE CVPR, 2011.
  • [43] N. M. Law, C. D. Mackay, and J. E. Baldwin, “Lucky imaging: high angular resolution imaging in the visible from the ground,” Astronomy & Astrophysics, vol. 446, no. 2, pp. 739–745, 2006.
  • [44] Y. Matsushita, E. Ofek, W. Ge, X. Tang, and H.-Y. Shum, “Full-frame video stabilization with motion inpainting,” IEEE T-PAMI, vol. 28, no. 7, pp. 1150–1163, 2006.
  • [45] S. Cho and S. Lee, “Fast motion deblurring,” ACM Transactions on Graphics (TOG), vol. 28, no. 5, 2009.
  • [46] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-D transform-domain collaborative filtering,” IEEE T-IP, vol. 16, no. 8, pp. 2080–2095, Aug 2007.
  • [47] A. Buades, B. Coll, and J. Morel, “Image denoising methods. a new nonlocal principle,” SIAM Rev., vol. 52, no. 1, pp. 113–147, 2010.
  • [48] P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz, “Adaptive deblocking filter,” IEEE TCSVT, vol. 13, no. 7, pp. 614–619, July 2003.
  • [49] J. S. L. Howard C. Reeve, “Reduction of blocking effects in image coding,” Optical Engineering, vol. 23, pp. 23 – 23 – 4, 1984.
  • [50] A. Foi, V. Katkovnik, and K. Egiazarian, “Pointwise shape-adaptive DCT for high-quality denoising and deblocking of grayscale and color images,” IEEE T-IP, vol. 16, no. 5, pp. 1395–1411, May 2007.
  • [51] C. Dong, Y. Deng, C. C. Loy, and X. Tang, “Compression artifacts reduction by a deep convolutional network,” in IEEE ICCV, 2015.
  • [52] S. Lin and H.-Y. Shum, “Separation of diffuse and specular reflection in color images,” in IEEE CVPR, 2001.
  • [53] Y. Shih, D. Krishnan, F. Durand, and W. T. Freeman, “Reflection removal using ghosting cues,” in IEEE CVPR, 2015.
  • [54] J. Yang, J. Wright, T. Huang, and Y. Ma, “Image super-resolution as sparse representation of raw image patches,” in IEEE CVPR, 2008.
  • [55] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE Transactions on Image Processing, vol. 19, no. 11, pp. 2861–2873, Nov 2010.
  • [56] W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-based super-resolution,” IEEE CG&A, vol. 22, no. 2, pp. 56–65, 2002.
  • [57] N. Efrat, D. Glasner, A. Apartsin, B. Nadler, and A. Levin, “Accurate blur models vs. image priors in single image super-resolution,” in IEEE ICCV, 2013.
  • [58] G. Freedman and R. Fattal, “Image and video upscaling from local self-examples,” ACM Transactions on Graphics (TOG), vol. 30, no. 2, 2011.
  • [59] R. Timofte, V. De Smet, and L. Van Gool, “Anchored neighborhood regression for fast example-based super-resolution,” in IEEE ICCV, 2013.
  • [60] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image super-resolution using very deep convolutional networks,” in IEEE CVPR, 2016.
  • [61] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Deep laplacian pyramid networks for fast and accurate superresolution,” in IEEE CVPR, 2017.
  • [62] R. Fattal, “Single image dehazing,” in ACM SIGGRAPH 2008 Papers, ser. SIGGRAPH ’08.   New York, NY, USA: ACM, 2008, pp. 72:1–72:9.
  • [63] Y. Y. Schechner, S. G. Narasimhan, and S. K. Nayar, “Instant dehazing of images using polarization,” in IEEE CVPR, 2001.
  • [64] K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 12, pp. 2341–2353, Dec 2011.
  • [65] R. T. Tan, “Visibility in bad weather from a single image,” in IEEE CVPR, 2008.
  • [66] J. P. Tarel and N. Hautiere, “Fast visibility restoration from a single color or gray level image,” in IEEE ICCV, 2009.
  • [67] X. Zhang, H. Li, Y. Qi, W. K. Leow, and T. K. Ng, “Rain removal in video by combining temporal and chromatic properties,” in IEEE International Conference on Multimedia and Expo, 2006.
  • [68] P. C. Barnum, S. Narasimhan, and T. Kanade, “Analysis of rain and snow in frequency space,” IJCV, vol. 86, no. 2, p. 256, Jan 2009.
  • [69] S. Y. Jiaying Liu, Wenhan Yang and Z. Guo, “Erase or fill? deep joint recurrent rain removal and reconstruction in videos,” IEEE CVPR, 2018.
  • [70] J. Chen, C. Tan, J. Hou, L. Chau, and H. Li, “Robust video content alignment and compensation for rain removal in a CNN framework,” CoRR, vol. abs/1803.10433, 2018.
  • [71] J. Guo and H. Chao, “One-to-many network for visually pleasing compression artifacts reduction,” in IEEE CVPR, 2017.
  • [72] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE T-IP, vol. 26, no. 7, pp. 3142–3155, July 2017.
  • [73] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory network for image restoration,” in IEEE ICCV, 2017.
  • [74] K. Yu, C. Dong, L. Lin, and C. C. Loy, “Crafting a toolchain for image restoration by deep reinforcement learning,” CoRR, vol. abs/1804.03312, 2018.
  • [75] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional networks,” in IEEE CVPR, 2010.
  • [76] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional networks for mid and high level feature learning,” in IEEE ICCV, 2011.
  • [77] M. Waleed Gondal, B. Schölkopf, and M. Hirsch, “The Unreasonable Effectiveness of Texture Transfer for Single Image Super-resolution,” ArXiv e-prints, Jul. 2018.
  • [78] K. Tahboub, D. Güera, A. R. Reibman, and E. J. Delp, “Quality-adaptive deep learning for pedestrian detection,” in IEEE ICIP, 2017.
  • [79] M. Hradiš, J. Kotera, P. Zemčík, and F. Šroubek, “Convolutional neural networks for direct text deblurring,” in BMVC, 2015.
  • [80] L. Xiao, J. Wang, W. Heidrich, and M. Hirsch, “Learning high-order filters for efficient blind deconvolution of document photographs,” in ECCV, 2016.
  • [81] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” CoRR, vol. abs/1603.08511, 2016. [Online]. Available: http://arxiv.org/abs/1603.08511
  • [82] V. P. Namboodiri, V. D. Smet, and L. V. Gool, “Systematic evaluation of super-resolution using classification,” in VCIP, 2011.
  • [83] V. Sharma, A. Diba, D. Neven, M. S. Brown, L. V. Gool, and R. Stiefelhagen, “Classification driven dynamic image enhancement,” CoRR, vol. abs/1710.07558, 2017.
  • [84] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng, “AOD-Net: All-in-one dehazing network,” in IEEE ICCV, 2017.
  • [85] Y. Yao, B. R. Abidi, N. D. Kalka, N. A. Schmid, and M. A. Abidi, “Improving long range and high magnification face recognition: Database acquisition, evaluation, and enhancement,” CVIU, vol. 111, no. 2, pp. 111–125, 2008.
  • [86] M. Nishiyama, H. Takeshima, J. Shotton, T. Kozakaya, and O. Yamaguchi, “Facial deblur inference to improve recognition of blurred faces,” in IEEE CVPR, 2009.
  • [87] H. Zhang, J. Yang, Y. Zhang, N. M. Nasrabadi, and T. S. Huang, “Close the loop: Joint blind image restoration and recognition with sparse representation prior,” in IEEE ICCV, 2011.
  • [88] C. Fookes, F. Lin, V. Chandran, and S. Sridharan, “Evaluation of image resolution and super-resolution on face recognition performance,” Journal of Visual Communication and Image Representation, vol. 23, no. 1, pp. 75 – 93, 2012.
  • [89] F. W. Wheeler, X. Liu, and P. H. Tu, “Multi-frame super-resolution for face recognition,” in IEEE BTAS, 2007.
  • [90] J. Wu, S. Ding, W. Xu, and H. Chao, “Deep joint face hallucination and recognition,” CoRR, vol. abs/1611.08091, 2016.
  • [91] F. Lin, J. Cook, V. Chandran, and S. Sridharan, “Face recognition from super-resolved images,” in Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, 2005.
  • [92] F. Lin, C. Fookes, V. Chandran, and S. Sridharan, “Super-resolved faces for improved face recognition from surveillance video,” Advances in Biometrics, 2007.
  • [93] P. H. Hennings-Yeomans, S. Baker, and B. V. Kumar, “Simultaneous super-resolution and feature extraction for recognition of low-resolution faces,” in IEEE CVPR, 2008.
  • [94] J. Yu, B. Bhanu, and N. Thakoor, “Face recognition in video with closed-loop super-resolution,” in IEEE CVPRW, 2011.
  • [95] H. Huang and H. He, “Super-resolution method for face recognition using nonlinear mappings on coherent features,” IEEE T-NN, vol. 22, no. 1, pp. 121–130, 2011.
  • [96] T. Uiboupin, P. Rasti, G. Anbarjafari, and H. Demirel, “Facial image super resolution using sparse representation for improving face recognition in surveillance monitoring,” in SIU, 2016.
  • [97] P. Rasti, T. Uiboupin, S. Escalera, and G. Anbarjafari, “Convolutional neural network super resolution for face recognition in surveillance monitoring,” in International Conference on Articulated Motion and Deformable Objects, 2016.
  • [98] X.-Y. Jing, X. Zhu, F. Wu, X. You, Q. Liu, D. Yue, R. Hu, and B. Xu, “Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning,” in IEEE CVPR, 2015.
  • [99] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” arXiv preprint, 2018.
  • [100] Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in IEEE CVPR, 2018.
  • [101] F. Kingdom and N. Prins, Psychophysics: a Practical Introduction.   Academic Press, 2016.
  • [102] M. J. C. Crump, J. V. McDonnell, and T. M. Gureckis, “Evaluating Amazon’s Mechanical Turk as a tool for experimental behavioral research,” in PloS one, 2013.
  • [103] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
  • [104] C. Fellbaum, WordNet: An Electronic Lexical Database.   Cambridge, MA: MIT Press, 1998.
  • [105] C. Vondrick, D. Patterson, and D. Ramanan, “Efficiently scaling up crowdsourced video annotation,” IJCV, vol. 101, no. 1, pp. 184–204, Jan. 2013.
  • [106] F. Heide, M. Steinberger, Y.-T. Tsai, M. Rouf, D. Pająk, D. Reddy, O. Gallo, J. Liu, W. Heidrich, K. Egiazarian, J. Kautz, and K. Pulli, “Flexisp: A flexible camera image processing framework,” ACM Trans. Graph., vol. 33, no. 6, pp. 231:1–231:13, Nov. 2014. [Online]. Available: http://doi.acm.org/10.1145/2661229.2661260
  • [107] C. J. Pellizzari, R. Trahan, H. Zhou, S. Williams, S. E. Williams, B. Nemati, M. Shao, and C. A. Bouman, “Optically coherent image formation and denoising using a plug and play inversion framework,” Applied optics, vol. 56, no. 16, pp. 4735–4744, 2017.
  • [108] J. R. Chang, C.-L. Li, B. Poczos, and B. V. Kumar, “One network to solve them all—solving linear inverse problems using deep projection models,” in 2017 IEEE International Conference on Computer Vision (ICCV).   IEEE, 2017, pp. 5889–5898.
  • [109] K. Zuiderveld, “Contrast limited adaptive histogram equalization,” Graphics Gems, pp. 474–485, 1994.
  • [110] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas, “Deblurgan: Blind motion deblurring using conditional adversarial networks,” CoRR, vol. abs/1711.07064, 2017.
  • [111] W. Yang, J. Feng, J. Yang, F. Zhao, J. Liu, Z. Guo, and S. Yan, “Deep edge guided recurrent residual learning for image super-resolution,” IEEE T-IP, vol. 26, no. 12, pp. 5895–5907, Dec 2017.
  • [112] X. Mao, C. Shen, and Y. Yang, “Image restoration using convolutional auto-encoders with symmetric skip connections,” CoRR, vol. abs/1606.08921, 2016. [Online]. Available: http://arxiv.org/abs/1606.08921
  • [113] X.-J. Mao, C. Shen, and Y.-B. Yang, “Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections,” in NIPS, 2016.
  • [114] M. Gharbi, J. Chen, J. T. Barron, S. W. Hasinoff, and F. Durand, “Deep bilateral learning for real-time image enhancement,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, p. 118, 2017.
  • [115] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Deep image prior,” CoRR, vol. abs/1711.10925, 2017.
  • [116] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
  • [117] H. Chen, T. Liu, and T. Chang, “Tone reproduction: A perspective from luminance-driven perceptual grouping,” in IEEE CVPR, 2005.
  • [118] “Spacenet,” Apr 2018. [Online]. Available: https://spacenetchallenge.github.io/
  • [119]

    J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in

    IEEE ICCV, 2017.
  • [120] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in IEEE ICCV, 2017.
  • [121]

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in

    ICML, 2015.
  • [122] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” CoRR, vol. abs/1607.08022, 2016.
  • [123] R. Keys, “Cubic convolution interpolation for digital image processing,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29, no. 6, pp. 1153–1160, Dec 1981.
  • [124] J. Pan, D. Sun, H.-P. Pfister, and M.-H. Yang, “Blind image deblurring using dark channel prior,” in IEEE CVPR, 2016.
  • [125] S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” CoRR, vol. abs/1612.02177, 2016.
  • [126] D. Kundur and D. Hatzinakos, “Blind image deconvolution,” IEEE Signal Processing Magazine, vol. 13, no. 3, pp. 43–64, 1996.
  • [127] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman, “Toward multimodal image-to-image translation,” in NIPS, 2017.