UG^2: a Video Benchmark for Assessing the Impact of Image Restoration andEnhancement on Automatic Visual Recognition

10/09/2017 ∙ by Rosaura G. Vidal, et al. ∙ University of Notre Dame University of Ljubljana 0

Advances in image restoration and enhancement techniques have led to discussion about how such algorithmscan be applied as a pre-processing step to improve automatic visual recognition. In principle, techniques like deblurring and super-resolution should yield improvements by de-emphasizing noise and increasing signal in an input image. But the historically divergent goals of the computational photography and visual recognition communities have created a significant need for more work in this direction. To facilitate new research, we introduce a new benchmark dataset called UG^2, which contains three difficult real-world scenarios: uncontrolled videos taken by UAVs and manned gliders, as well as controlled videos taken on the ground. Over 160,000 annotated frames forhundreds of ImageNet classes are available, which are used for baseline experiments that assess the impact of known and unknown image artifacts and other conditions on common deep learning-based object classification approaches. Further, current image restoration and enhancement techniques are evaluated by determining whether or not theyimprove baseline classification performance. Results showthat there is plenty of room for algorithmic innovation, making this dataset a useful tool going forward.



There are no comments yet.


page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

(Top) In principle, image restoration and enhancement techniques should improve visual recognition performance by creating higher quality inputs for recognition models. This is the case when a Super Resolution Convolutional Neural Network 

[7] is applied to the image in this panel. (Bottom) In practice, we often see the opposite effect — especially when new artifacts are unintentionally introduced, as in this application of Deep Deblurring [43]. We describe a new video dataset (Sec. 3) for the study of problems with algorithm and data interplay (Sec. 6) like this one.

To build a visual recognition system for any application these days, one’s first inclination is to turn to the most recent machine learning breakthrough from the area of deep learning, which no doubt has been enabled by access to millions of training images from the Internet. But there are many circumstances where such an approach cannot be used as an off-the-shelf component to assemble the system we desire, because even the largest training dataset does not take into account all of the artifacts that can be experienced in the environment. As computer vision pushes further into real-world applications, what should a software system that can interpret images from sensors placed in any unrestricted setting actually look like?

First, it must incorporate a set of algorithms, drawn from the areas of computational photography and machine learning, into a processing pipeline that corrects and subsequently classifies images across time and space. Image restoration and enhancement algorithms that remove corruptions like blur, noise, and mis-focus, or manipulate images to gain resolution, change perspective, and compensate for lens distortion are now commonplace in photo editing tools. Such operations are necessary to improve the quality of raw images that are otherwise unacceptable for recognition purposes. But they must be compatible with the recognition process itself, and not adversely affect feature extraction or classification (Fig. 


Remarkably, little thought has been given to image restoration and enhancement algorithms for visual recognition — the goal of computational photography thus far has simply been to make images look appealing after correction [60, 3, 46, 17, 43]. It remains unknown what impact many transformations have on visual recognition algorithms. To begin to answer that question, exploratory work is needed to find out which image pre-processing algorithms, in combination with the strongest features and supervised machine learning approaches, are promising candidates for different problem domains.

One popular problem that contains imaging artifacts typically not found in computer vision datasets crawled from the web is the interpretation of images taken by aerial vehicles [47, 38]. This task is a key component of a number of applications including robotic navigation, scene reconstruction, scientific study, entertainment, and visual surveillance. Images captured from aerial vehicles tend to present a wide variety of artifacts and optical aberrations. These can be the product of weather and scene conditions (, rain, light refraction, smoke, flare, glare, occlusion), movement (, motion blur), or the recording equipment (, video compression, sensor noise, lens distortion, mis-focus). How do we begin to address the need for image restoration and enhancement algorithms that are compatible for visual recognition in a scenario like this one?

In this paper, we propose the use of a benchmark dataset captured in realistic settings where image artifacts are common, as opposed to more typical imagery crawled from social media and web-based photo albums. Given the current popularity of aerial vehicles within computer vision, we suggest the use of data captured by UAVs and manned gliders as a relevant and timely challenge problem. But this can’t be the only data we consider, as we do not have knowledge of certain scene parameters that generated the artifacts of interest. Thus we suggest that ground-based video with controlled scene parameters and target objects is also essential. And finally, we need a set of protocols for evaluation that move away from a singular focus on perceived image quality to include classification performance. By combining all of these elements, we have created a new dataset dubbed UG (UAV, Glider, and Ground), which consists of hundreds of videos and over 150,000 annotated frames spanning hundreds of ImageNet classes [39].

In summary, the contributions of this paper are:

  • [noitemsep]

  • A new video benchmark dataset representing both ideal conditions and common aerial image artifacts, which we make available to facilitate new research and to simplify the reproducibility of experimentation111See the Supplemental Material for example videos from the dataset.The dataset can be accessed at:

  • An extensive evaluation of the influence of image aberrations and other problematic conditions on common object recognition models including VGG16 and VGG19 [42], InceptionV3 [44], and ResNet50 [14].

  • An analysis of the impact and suitability of basic and state-of-the-art image and video processing algorithms used in conjunction with common object recognition models. In this work, we look at deblurring [43, 33]

    , image interpolation, and super-resolution 

    [7, 21].

2 Related work

Datasets. The areas of image restoration and enhancement have a long history in computational photography, with associated benchmark datasets that are mainly used for the qualitative evaluation of image appearance. These include very small test image sets such as Set5 [3] and Set14 [60], and the set of blurred images introduced by Levin  [26]. Datasets containing more diverse scene content have been proposed including Urban100 [17] for enhancement comparisons and LIVE1 [41] for image quality assessment. While not originally designed for computational photography, the Berkeley Segmentation Dataset has been used by itself [17] and in combination with LIVE1 [52] for enhancement work. The popularity of deep learning methods has increased demand for training and testing data, which Su provide as video content for deblurring work [43]. Importantly, none of these datasets were designed to combine image restoration and enhancement with recognition for a unified benchmark.

Most similar to the dataset we introduce in this paper are various large-scale video surveillance datasets, especially those which provide a “fixed” overhead view of urban scenes [10, 13, 40, 62]

. However, these datasets are primarily meant for other research areas (, event/action understanding, video summarization, face recognition) and are ill-suited for object recognition tasks, even if they share some common imaging artifacts that impair recognition as a whole. With respect to data collected by aerial vehicles, the VIRAT Video Dataset 

[35] contains “realistic, natural and challenging (in terms of its resolution, background clutter, diversity in scenes)” imagery for event recognition. Other datasets including aerial imagery are the UCF Aerial Action Data Set [1], UCF-ARG [2], UAV123 [32], and the multi-purpose dataset introduced by Yao  [55]. As with the computational photography datasets, none of these sets have specific protocols for image restoration and enhancement coupled with object recognition.

Table 1: Summary of the UG dataset (See Supp. Tables 3 and 4 for a detailed breakdown of these conditions).

Restoration and Enhancement for Recognition. In this paper we consider the image restoration technique of deblurring, where the objective is to recover a sharp version of a blurry image without knowledge of the blur parameters. When considering motion, the original sharp image is convolved with a blur kernel :  [27]. Accordingly, the sharp image can be recovered through deconvolution [24, 26, 19, 25] (see [51] for a comprehensive list of deconvolution techniques for deblurring) or methods that use multi-image aggregation and fusion [23, 31, 5]. Intuitively, if an image has been corrupted by blur, then deblurring should improve performance of recognizing objects in the image. An early attempt at unifying a high-level task like object recognition with a low-level task like deblurring was the Deconvolutional Network [58, 59]. Additional work has been undertaken in face recognition [56, 34, 61]. In this work, we look at deep learning-based deblurring techniques [33, 43] and a basic blind deconvolution method [22].

With respect to enhancement, we focus on the specific technique of single image super-resolution, where an attempt is made at estimating a high-resolution image

from a single low-resolution image

. The relationship between these images can be modeled as a linear transformation

, where is a matrix that encodes the processes of blurring and downsampling, and is a noise term [8]. A number of super-resolution techniques exist including sparse representation [53, 54], Nearest Neighbor approaches [12], image priors [8], local self examples [11], neighborhood embedding [45], and deep learning [7, 21]. Super-resolution can potentially help in object recognition by amplifying the signal of the target object to be recognized. Thus far, such a strategy has been limited to research in face recognition [28, 29, 15, 57, 16, 49, 37] and person re-identification [18] algorithms for video surveillance data. Here we look at simple interpolation methods [20] and deep learning-based super-resolution [7, 21].

3 The UG Dataset

UG is composed of videos with frames, representing ImageNet [39] classes extracted from annotated frames from three different video collections (see Supp. Table 2 for the complete list of classes). These classes are further categorized into 37 super-classes encompassing visually similar ImageNet categories and two additional classes for pedestrian and resolution chart images (this distribution is explained in detail below). The three different video collections consist of: (1) 50 Creative Commons tagged videos taken by fixed-wing unmanned aerial vehicles (UAV) obtained from YouTube; (2) videos recorded by pilots of fixed wing gliders; and (3) controlled videos captured on the ground specifically for this dataset. Table 1 presents a summary of the dataset and Fig. 2 presents example frames from each of the collections.

(a) UAV Collection
(b) Glider Collection
(c) Controlled Ground Collection
Figure 2: Examples of images in the three UG collections.

Furthermore, the dataset contains a subset of object-level annotated images. Bounding boxes establishing object regions were manually determined using the Vatic tool for video annotation [50]. Each annotation in the dataset indicates the position, scale, visibility and super-class for an object in a video. This is useful for running classification experiments. For example, for the baseline experiments described in Sec. 6, the objects were cropped out from the frames in a square region of at least pixels (a common input size for many deep learning-based recognition models), using the annotations as a guide. Videos are also tagged to indicate problematic conditions.

UAV Video Collection. This collection found within UG consists of video recorded from small UAVs in both rural and urban areas. The videos in this collection are open source content tagged with a Creative Commons license, obtained from the YouTube video sharing site. Because of the source, they have different video resolutions (from to ) and frame rates (from FPS to FPS). This collection contains approximately hours of aerial video distributed across different videos.

For this collection we observed 8 different video artifacts and other problems: glare/lens flare, poor image quality, occlusion, over/under exposure, camera shaking and noise (present in some videos that use autopilot telemetry), motion blur, and fish eye lens distortion. Additionally this collection contains videos with problematic weather/scene conditions such as night/low light video, fog, cloudy conditions and occlusion due to snowfall. Overall it contains frames. Across a subset of these frames we observed different super-classes (including the non-ImageNet pedestrians class), from which we extracted object images. The cropped object images have a diverse range of sizes, from to .

Figure 3: Distribution of annotated images belonging to classes shared by at least two different UG collections.

Glider Video Collection. This collection found within UG consists of video recorded by licensed pilots of fixed wing gliders in both rural and urban areas. It contains approximately hours of aerial video, distributed across different videos. The videos have frame rates ranging from FPS to FPS and different types of compression such as MTS, MP4 and MOV. Given the nature of this collection the videos mostly present imagery taken from thousands of feet above ground, further increasing the difficulty of object recognition tasks. Additionally, scenes of take off and landing contain artifacts such as motion blur, camera shaking, and occlusion (which in some cases is pervasive throughout the videos, showcasing parts of the glider that partially occlude the objects of interest).

For the Glider Collection we observed different video artifacts and other problems: glare/lens flare, over/under exposure, camera shaking and noise, occlusion, motion blur, and fish eye lens distortion. Furthermore, this collection contains videos with problematic weather/scene conditions such as fog, clouds and occlusion due to rain. Overall this collection contains frames. Across the annotated frames we observed different classes (including the non-ImageNet class of pedestrians), from which we extracted object images. The cropped object images have a diverse range of sizes, from to .

Ground Video Collection. In order to provide some ground-truth with respect to problematic image conditions, we performed a video collection on the ground that intentionally induced several common artifacts. One of the main challenges for object recognition within aerial images is the difference in the scale of certain objects compared to those in the images used to train the recognition model. To address this, we recorded video of static objects (, flower pots, buildings) at a wide range of distances (ft, ft, ft, ft, ft, ft, ft, and ft).

In conjunction with the differing recording distances, we induced motion blur in images using an orbital shaker to generate horizontal movement at different rotations per minute (rpm, rpm, rpm, and rpm). Parallel to this, we recorded video under different weather conditions (sun, clouds, rain, snow) that could affect object recognition, and employed a Sony Bloggie hand-held camera (with resolution and a frame rate of 60 FPS) and a GoPro Hero 4 (with resolution and a frame rate of 30 FPS), whose fisheye lens introduced further distortion.

The Ground Collection contains approximately minutes of video, distributed across videos. Overall this collection represents annotated frames, and unannotated frames distributed across specific videos. The annotated frames contain different ImageNet classes. Furthermore, an additional class of videos showcasing a inch checkerboard grid exhibiting all aforementioned distances and all intervals of rotation. The motivation for including this artificial class is to provide a reference with well-defined straight lines to assess the visual impact of image restoration and enhancement algorithms. The cropped object images have a diverse range of sizes: from to .

Object Categories and Distribution of Data. A challenge presented by the objects annotated in the UAV and Glider collections is the high variability of both object scale and rotation. These two factors make it difficult to differentiate some of the more fine-grained ImageNet categories. For example, while it may be easy to recognize a car from an aerial picture taken from hundreds (if not thousands) of feet above the ground, it might be impossible to determine whether that car is a taxi, a jeep or a sports car. An exception to this rule is the Ground Collection where we had more control over distances from the target which made possible the fine-grained class distinction. For example, chainlink-fence and bannisters are separate classes in the Ground Collection and are not combined to form a fence super-class.

Thus UG organizes the objects in high level classes that encompass multiple ImageNet synsets (ImageNet provides images for “synsets” or “synonym sets” of words or phrases that describe a concept in WordNet [9]; for more detail on the relationship between UG and ImageNet classes see Supp. Table 1). Over 70% of the UG classes have more than 400 images and 58% of the classes are present in the imagery of at least two collections (Fig. 3). Around 20% of the classes are present in all three collections.

4 The UG Classification Protocols

In order to establish good baselines for classification performance before and after the application of image enhancement and restoration algorithms, we used a selection of common deep learning approaches to recognize annotated objects and then considered the correct classification rate. Namely, we used the Keras 

[6] versions of the pre-trained networks VGG16 & VGG19 [42], Inception V3 [44], and ResNet50 [14]. These experiments also serve as a demonstration of the UG classification protocols. Each candidate restoration or enhancement algorithm should be treated as an image pre-processing step to prepare data to be submitted to all four networks, which serve as canonical classification references. The entirety of the annotated data for all three collections is used for evaluation, with the exceptions of the pedestrian and resolution chart classes, which do not belong to any synsets recognized by the networks. For our current experiments, we restricted our analysis on UG dataset to pre-trained networks. Re-training the networks with our dataset would be considered in future. With respect to restoration and enhancement approaches that must be trained, we suggest a cross-dataset protocol [48] where some annotated training data should come from outside UG. However, un-annotated videos for additional validation purposes and parameter tuning are provided.

Classification Metrics. The networks used for the UG

classification task return a list of the ImageNet synsets along with the probability of the object belonging to each of the synsets classes. However, given what we discussed in Sec. 

3, in some cases it is impossible to provide a fine-grained labeling for the annotated objects. Consequently, most of the super-classes we defined for UG are composed of more than one ImageNet synset. That is, each annotated image has a single super-class label which in turn is defined by a set of ImageNet synsets .

To measure accuracy, we observe the number of correctly identified synsets in the top 5 predictions made by each pre-trained network. A prediction is considered to be correct if it’s synset belongs to the set of synsets in the ground-truth super-class label. We use two metrics for this. The first measures the rate of detection of at least 1 correctly classified synset class. In other words, for a super-class label , a network is able to detect 1 or more correctly classified synsets in the top 5 predictions. The second measures the rate of detecting all the possible correct synset classes in the super-class label synset set. For example, for a super-class label , a network is able to detect 3 correct synsets in the top 5 labels.

5 Baseline Enhancement and Restoration

To shed light on the effects image restoration and enhancement algorithms have on classification, we tested classic and state-of-the-art algorithms for image interpolation [20], super-resolution [7, 21], and deblurring [33, 43, 36] (see Supp. Fig. 1 for examples).

Interpolation methods. These classic methods attempt to obtain a high resolution image by up-sampling the source image (usually assuming the source image is a down-sampled version of the high resolution one) and by providing the best approximation of a pixel’s color and intensity values depending on the nearby pixels. Since they do not need any prior training, they can be directly applied to any image. Nearest neighbor interpolation uses a weighted average of the nearby translated pixel values in order to calculate the output pixel value. Bilinear interpolation increases the number of translated pixel values to two and bicubic interpolation increases it to four.

SRCNN. The Super-Resolution Convolutional Neural Network (SRCNN) [7] introduced deep learning techniques to super-resolution. The method employs a feedforward deep CNN to learn an end-to-end mapping between low resolution and high resolution images. The network was trained on 5 million “sub-images” generated from 395,909 images of the ILSVRC 2013 ImageNet detection training partition [39]. Typically, the results obtained from SRCNN can be distinguished from their low resolution counterparts by their sharper edges without visible artifacts.

VDSR. The Very Deep Super Resolution (VDSR) algorithm [21] aims to outperform SRCNN by employing a deeper CNN inspired by the VGG architecture [42]. It also decreases training iterations and time by employing residual learning with a very high learning rate for faster convergence. The VDSR network was trained on 291 images, collectively taken from Yang [54] and the Berkeley Segmentation Dataset [30]. Unlike SRCNN, the network is capable of handling different scale factors. A good image processed by VDSR is characterized by well-defined contours and a lack of edge effects at the borders.

Basic Blind Deconvolution. The goal of any deblurring algorithm is to attempt to remove blur artifacts (, the products of motion or depth variation, either from the object or the camera) that degrade image quality. This can be as simple as employing Matlab’s blind deconvolution algorithm [22], which deconvolves the image using the maximum likelihood algorithm, with a array of 1s as the initial point spread function.

Deep Video Deblurring. The Deep Video Deblurring algorithm [43] was designed to address camera shake blur. However, in the results presented by Su the algorithm also obtained good results for other types of blur, such as motion blur. This algorithm employs a CNN that was trained with video frames containing synthesized motion blur such that it receives a stack of neighboring frames and returns a deblurred frame. The algorithm allows for three types of frame-to-frame alignment: no alignment, optical flow alignment, and homography alignment. For our experiments we used optical flow alignment, which was reported to have the best performance with this algorithm.

Deep Dynamic Scene Deblurring. Similarly, the Deep Dynamic Scene Deblurring algorithm [33] utilizes deep learning in order to remove motion blur. Nah implement a multi-scale CNN to restore blurred images in an end-to-end manner without assuming or estimating a blur kernel model. The network was trained using blurry images generated by averaging sequences (by considering gamma correction) of sharp frames in a dynamic scene with high speed cameras. Given that this algorithm was computationally expensive, we directly applied it to the cropped object regions, rather than to the full video frame.

6 Ug Baseline Results and Analysis

Original Classification Results. Fig. 4 depicts the baseline classification results for the UG collections, without any pre-processing, at rank 5 (results for top 1 predictions can be found in Supp. Figs. 2-4). Overall we observed distinct differences between the results for all three collections, particularly between the airborne collections (UAV and Glider collections) and the Ground Collection. These results establish that common deep learning networks alone cannot achieve good classification rates for this dataset.

Figure 4: Classification rates at rank 5 for the original, unprocessed, frames for each collection in the dataset.
Figure 5: Comparison of classification rates at rank 5 for each collection after applying resolution enhancement.

Given the very poor quality of its videos, the UAV Collection turned out to be the most challenging in terms of object classification, obtaining the lowest classification performance out of the three collections. While the Glider Collection shared similar problematic conditions with the UAV Collection, we found that the images in this collection had a slightly higher classification rate than those in the UAV Collection in terms of identifying at least one correctly classified synset class. This improvement might be caused by the limited degree of movement of the gliders, since it ensured that the movement between frames was kept more stable over time, and by the camera’s recording quality. The controlled Ground Collection yielded the highest classification rates, which, in an absolute sense, are still low.

Figure 6: Comparison of classification rates at rank 5 for each collection after applying deblurring.

Effect of Restoration and Enhancement. Ideally, image restoration and enhancement algorithms should help object recognition by improving poor quality images and should not impair it for good quality images. To test this assumption for the algorithms described in Sec. 5, we used them to pre-process the annotated video frames of UG and then proceeded to re-crop the objects of interest using the annotated bounding box information (as described in Sec. 3). Given that the scale of the images enhanced with the interpolation algorithms was doubled, the bounding boxes were scaled accordingly in those cases. Furthermore, the cropped object images were re-sized to (input for VGG16, VGG19 and ResNet50) and (input for Inception V3) during the classification experiments. See Supp. Tables  5-8, 9-13, and 13-16 for detailed breakdowns of the results for what follows.

(a) Impact of different weather conditions on the baseline Ground Collection.
(b) Impact of resolution enhancement techniques.
(c) Impact of deblurring techniques.
Figure 7: Comparison of classification rates for different weather conditions at rank 5 for the Ground Collection. To simplify the analysis, each point represents the performance when considering the output of all four networks simultaneously.

As can be observed in Figs. 5 and 6, the behaviour of the resolution enhancement and deblurring algorithms is different between the airborne and Ground collections. For the most part, both types of algorithms tended to improve the rate of identification for at least one correct class for all of the networks for the UAV and Glider collections (Figs. 4(a)4(b)5(a), and 5(b)). Over 60% of the experiments reported an increase in the correct classification rate compared to that of the baseline. Conversely, for the Ground Collection, the restoration and enhancement algorithms seemed to impair the classification for all networks (Figs. 4(c) and 5(c)), going as far as reducing the at least one class identification performance by more than 16% for some experiments. More than 60% of the experiments reported a decrease in the classification rate for the Ground Collection. The property of hurting recognition performance on good quality imagery is certainly undesirable in these cases.

Further along these lines, while the classification rate for at least one correct class was increased for the airborne collections, after employing enhancement techniques the classification rate for finding all possible sub-classes in the super-class was negatively impacted for all three collections. Between 53% and 68% of the experiments reported a decrease in this metric. But this behaviour seemed to be more prevalent for the deblurring algorithms. For the UAV and Glider collections 75% and 92% of the deblurring experiments respectively had a negative impact in the classification rate for finding all possible classes, while only 40% and 45% of the resolution enhancement experiments reported a negative impact for the same metric.

We can also consider the performance with respect to individual networks. For the UAV (Fig. 4(a)) and Glider (Fig. 4(b)) Collections, SRCNN provided the best results for the VGG16 and VGG19 networks in both metrics, and was also the best in improving the rate of finding all the correct synsets of their respective super-class. Nevertheless, the best overall improvement for the rate of correctly classifying at least one class in both collections was achieved by employing the Dynamic Deep Deblurring algorithm, with an improvement of 8.96% for the Inception network in the UAV case, and 6.5% for the Glider case. Resolution enhancement algorithms dominated the classification rate improvement for the Ground Collection, where VDSR obtained the highest improvement in both metrics for the VGG16, VGG19 (the best with 3.56% improvement), and Inception networks, while Bilinear interpolation achieved the highest improvement for the ResNet50 network.

In contrast, Blind Deconvolution drove down performance in all of the algorithms we tested for almost all networks. For the UAV Collection, Blind Deconvolution led to a decrease of at most 6.07% in the rate of classifying at least 1 class correctly for the ResNet50 network. This behaviour was also observed for the Glider and Ground collections, where it led to the highest decreases in the classification rate of both metrics for all networks. These being 7% for the ResNet50 network for the Glider Collection and 15.06% for the VGG16 network for the Ground Collection.

Effect of Weather Conditions. A significant contribution of our dataset is the availability of ground-truth for weather conditions in the Ground collection. Without any pre-processing applied to that set, the classification performance under different weather conditions varies widely (Fig. 6(a)). In particular, there was a stark contrast between the classification rates of video taken during rainy weather and video taken under any other condition, with rain causing the classification rate for both metrics to drop. Likewise, snowy weather presented a lower classification rate than cloudy or sunny weather as it shares some of the problems of rainy video capture: adverse lighting conditions and partial object occlusion from the precipitation. Cloudy weather proved to be the most beneficial for image capture as those videos lacked most of the problems of the other conditions. Sunny conditions are not the best because of glare. This study confirms previously reported results for the impact of weather on recognition [4].

We also analyzed the interplay between the different restoration and enhancement algorithms and different weather conditions (Figs. 6(b) and 6(c); see Supp. Tables 17-20 for detailed results). For this analysis we observed that resolution enhancement algorithms provided small benefits for both metrics. 50% of the experiments improved the correct classification rate of at least one class, and 40.63% improved the other metric. Again, resolution enhancement algorithms tended to provide the most improvement. The highest improvement (3.36% for the correct classification rate of at least one class) was achieved for sunny weather by the VDSR algorithm. Note that while classification for the more problematic weather conditions (rain, snow and sun) was improved, this was not the case for cloudy weather, where the original images were already of high quality.

7 Discussion

The results of our experiments led to some surprises. While the restoration and enhancement algorithms tended to improve the classification results for the diverse imagery included in our dataset, no approach was able to improve the results by a significant margin. Moreover, in some cases, performance degraded after image pre-processing, particularly for higher quality frames, making these kind of pre-processing techniques unviable for heterogeneous datasets. We also noticed that different algorithms for the same type of image processing can have very different effects, as can different combinations of pre-processing and recognition algorithms. Depending on the metric considered, performance could be better or worse for various techniques. A possible reason for this can be that most of these networks were trained with images having a single type of image distortion and hence fail for images with multiple distortions from heterogenous sources. Significant improvement can be achieved if the networks are re-trained with UG. However, this needs further investigation and would be done in future. Thus, UG dataset will prove to be useful for studying these phenomena for quite some time to come.

UG forms the core of a large prize challenge that will be announced in Fall 2017 and run from Spring to early Summer 2018. In this paper, we described one protocol that is part of that challenge. Several alternate protocols that are useful for research in this direction will also be included. For instance, we did not look at networks that combine feature learning, image enhancement and restoration, and classification. A protocol supporting this will be available. UG can also be used for more traditional computational photography assessment (, making the images look better), and this too will be supported. Stay tuned for more.

Acknowledgement Funding was provided under IARPA contract #2016-16070500002. This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein. We thank Dr. Adam Czajka, visiting assistant professor at the University of Notre Dame and Mr. Sebastian Kawa for assistance with data collection.


  • [1] UCF Aerial Action data set.
  • [2] UCF-ARG data set.
  • [3] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In British Machine Vision Conference, 2012.
  • [4] T. E. Boult and W. J. Scheirer. Long range facial image acquisition and quality. In M. Tistarelli, S. Li, and R. Chellappa, editors, In Handbook of Remote Biometrics: for Surveillance and Security (Springer-Verlag). Springer-Verlag, August 2009.
  • [5] S. Cho and S. Lee. Fast motion deblurring. ACM Transactions on Graphics (TOG), 28(5), 2009.
  • [6] F. Chollet et al. Keras., 2015.
  • [7] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision (ECCV), 2014.
  • [8] N. Efrat, D. Glasner, A. Apartsin, B. Nadler, and A. Levin. Accurate blur models vs. image priors in single image super-resolution. In IEEE International Conference on Computer Vision (ICCV), 2013.
  • [9] C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998.
  • [10] R. B. Fisher. The PETS04 surveillance ground-truth data sets. In Proc. 6th IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, 2004.
  • [11] G. Freedman and R. Fattal. Image and video upscaling from local self-examples. ACM Transactions on Graphics (TOG), 30(2), 2011.
  • [12] W. T. Freeman, T. R. Jones, and E. C. Pasztor. Example-based super-resolution. IEEE Computer Graphics and Applications, 22(2):56–65, 2002.
  • [13] M. Grgic, K. Delac, and S. Grgic. SCface–surveillance cameras face database. Multimedia Tools and Applications, 51(3):863–879, 2011.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
  • [15] P. H. Hennings-Yeomans, S. Baker, and B. V. Kumar. Simultaneous super-resolution and feature extraction for recognition of low-resolution faces. In

    Computer Vision and Pattern Recognition (CVPR), 2008.

    IEEE, 2008.
  • [16] H. Huang and H. He. Super-resolution method for face recognition using nonlinear mappings on coherent features. IEEE Transactions on Neural Networks, 22(1):121–130, 2011.
  • [17] J.-B. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-exemplars. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015., 2015.
  • [18] X.-Y. Jing, X. Zhu, F. Wu, X. You, Q. Liu, D. Yue, R. Hu, and B. Xu. Super-resolution person re-identification with semi-coupled low-rank discriminant dictionary learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015., 2015.
  • [19] N. Joshi, C. L. Zitnick, R. Szeliski, and D. J. Kriegman. Image deblurring and denoising using color priors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  • [20] R. Keys. Cubic convolution interpolation for digital image processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(6):1153–1160, Dec 1981.
  • [21] J. Kim, J. K. Lee, and K. M. Lee. Accurate image super-resolution using very deep convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [22] D. Kundur and D. Hatzinakos. Blind image deconvolution. IEEE Signal Processing Magazine, 13(3):43–64, 1996.
  • [23] N. M. Law, C. D. Mackay, and J. E. Baldwin. Lucky imaging: high angular resolution imaging in the visible from the ground. Astronomy & Astrophysics, 446(2):739–745, 2006.
  • [24] A. Levin, R. Fergus, F. Durand, and W. T. Freeman. Image and depth from a conventional camera with a coded aperture. ACM Transactions on Graphics (TOG), 26(3), 2007.
  • [25] A. Levin and B. Nadler. Natural image denoising: Optimality and inherent bounds. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
  • [26] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman. Understanding and evaluating blind deconvolution algorithms. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  • [27] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman. Efficient marginal likelihood optimization in blind deconvolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
  • [28] F. Lin, J. Cook, V. Chandran, and S. Sridharan. Face recognition from super-resolved images. In Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, volume 2, 2005.
  • [29] F. Lin, C. Fookes, V. Chandran, and S. Sridharan. Super-resolved faces for improved face recognition from surveillance video. Advances in Biometrics, pages 1–10, 2007.
  • [30] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In IEEE International Conference on Computer Vision (ICCV), 2001.
  • [31] Y. Matsushita, E. Ofek, W. Ge, X. Tang, and H.-Y. Shum. Full-frame video stabilization with motion inpainting. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 28(7):1150–1163, 2006.
  • [32] M. Mueller, N. Smith, and B. Ghanem. A benchmark and simulator for UAV tracking. In European Conference on Computer Vision (ECCV), 2016.
  • [33] S. Nah, T. H. Kim, and K. M. Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. CoRR, abs/1612.02177, 2016.
  • [34] M. Nishiyama, H. Takeshima, J. Shotton, T. Kozakaya, and O. Yamaguchi. Facial deblur inference to improve recognition of blurred faces. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  • [35] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C. C. Chen, J. T. Lee, S. Mukherjee, J. K. Aggarwal, H. Lee, L. Davis, E. Swears, X. Wang, Q. Ji, K. Reddy, M. Shah, C. Vondrick, H. Pirsiavash, D. Ramanan, J. Yuen, A. Torralba, B. Song, A. Fong, A. Roy-Chowdhury, and M. Desai. A large-scale benchmark dataset for event recognition in surveillance video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
  • [36] J. Pan, D. Sun, H.-P. Pfister, and M.-H. Yang. Blind image deblurring using dark channel prior. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [37] P. Rasti, T. Uiboupin, S. Escalera, and G. Anbarjafari. Convolutional neural network super resolution for face recognition in surveillance monitoring. In International Conference on Articulated Motion and Deformable Objects, 2016.
  • [38] V. Reilly, B. Solmaz, and M. Shah. Shadow casting out of plane (scoop) candidates for human and vehicle detection in aerial imagery. International Journal of Computer Vision (IJCV), 101(2):350–366, 2013.
  • [39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [40] J. Shao, C. C. Loy, and X. Wang. Scene-independent group profiling in crowd. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [41] H. R. Sheikh, M. F. Sabir, and A. C. Bovik. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on Image Processing (T-IP), 15(11):3440–3451, 2006.
  • [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • [43] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O. Wang. Deep video deblurring. CoRR, abs/1611.08387, 2016.
  • [44] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
  • [45] R. Timofte, V. De Smet, and L. Van Gool. Anchored neighborhood regression for fast example-based super-resolution. In IEEE International Conference on Computer Vision (ICCV), 2013.
  • [46] R. Timofte, V. De Smet, and L. Van Gool. A+: Adjusted anchored neighborhood regression for fast super-resolution. In Asian Conference on Computer Vision (ACCV), 2014.
  • [47] T. Tomic, K. Schmid, P. Lutz, A. Domel, M. Kassecker, E. Mair, I. L. Grixa, F. Ruess, M. Suppa, and D. Burschka. Toward a fully autonomous UAV: Research platform for indoor and outdoor urban search and rescue. IEEE Robotics & Automation Magazine, 19(3):46–56, 2012.
  • [48] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
  • [49] T. Uiboupin, P. Rasti, G. Anbarjafari, and H. Demirel. Facial image super resolution using sparse representation for improving face recognition in surveillance monitoring. In 24th Signal Processing and Communication Application Conference (SIU), 2016.
  • [50] C. Vondrick, D. Patterson, and D. Ramanan. Efficiently scaling up crowdsourced video annotation. International Journal of Computer Vision (IJCV), 101(1):184–204, Jan. 2013.
  • [51] R. Wang and D. Tao. Recent progress in image deblurring. arXiv preprint arXiv:1409.6838, 2014.
  • [52] C.-Y. Yang, C. Ma, and M.-H. Yang. Single-image super-resolution: A benchmark. In European Conference on Computer Vision (ECCV), 2014.
  • [53] J. Yang, J. Wright, T. Huang, and Y. Ma. Image super-resolution as sparse representation of raw image patches. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
  • [54] J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution via sparse representation. IEEE Transactions on Image Processing (T-IP), 19(11):2861–2873, 2010.
  • [55] B. Yao, X. Yang, and S.-C. Zhu. Introduction to a large-scale general purpose ground truth database: Methodology, annotation tool and benchmarks. In 6th International Conference on Energy Minimization Methods in Computer Vision and Pattern Recognition, 2007.
  • [56] Y. Yao, B. R. Abidi, N. D. Kalka, N. A. Schmid, and M. A. Abidi. Improving long range and high magnification face recognition: Database acquisition, evaluation, and enhancement. Computer Vision and Image Understanding (CVIU), 111(2):111–125, 2008.
  • [57] J. Yu, B. Bhanu, and N. Thakoor. Face recognition in video with closed-loop super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2011.
  • [58] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Deconvolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
  • [59] M. D. Zeiler, G. W. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In IEEE International Conference on Computer Vision (ICCV), 2011.
  • [60] R. Zeyde, M. Elad, and M. Protter. On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces, 2010.
  • [61] H. Zhang, J. Yang, Y. Zhang, N. M. Nasrabadi, and T. S. Huang. Close the loop: Joint blind image restoration and recognition with sparse representation prior. In IEEE International Conference on Computer Vision (ICCV), 2011.
  • [62] X. Zhu, C. C. Loy, and S. Gong. Video synopsis by heterogeneous multi-source correlation. In IEEE International Conference on Computer Vision (ICCV), 2013.