Assessing Image Quality Issues for Real-World Problems

by   Tai-Yin Chiu, et al.

We introduce a new large-scale dataset that links the assessment of image quality issues to two practical vision tasks: image captioning and visual question answering. First, we identify for 39,181 images taken by people who are blind whether each is sufficient quality to recognize the content as well as what quality flaws are observed from six options. These labels serve as a critical foundation for us to make the following contributions: (1) a new problem and algorithms for deciding whether an image is insufficient quality to recognize the content and so not captionable, (2) a new problem and algorithms for deciding which of six quality flaws an image contains, (3) a new problem and algorithms for deciding whether a visual question is unanswerable due to unrecognizable content versus the content of interest being missing from the field of view, and (4) a novel application of more efficiently creating a large-scale image captioning dataset by automatically deciding whether an image is insufficient quality and so should not be captioned. We publicly-share our datasets and code to facilitate future extensions of this work:



There are no comments yet.


page 2

page 4

page 7

page 12

page 13

page 14

page 15

page 16


Assessing Image Quality Issues for Real-World Problem

We introduce a new large-scale dataset that links the assessment of imag...

Captioning Images Taken by People Who Are Blind

While an important problem in the vision community is to design algorith...

Learning to Disambiguate by Asking Discriminative Questions

The ability to ask questions is a powerful tool to gather information in...

CapOnImage: Context-driven Dense-Captioning on Image

Existing image captioning systems are dedicated to generating narrative ...

Is that a Duiker or Dik Dik Next to the Giraffe? Impacts of Uncertainty on Classification Efficiency in Citizen Science

Quality control is an ongoing concern in citizen science that is often m...

Exploring large scale public medical image datasets

Rationale and Objectives: Medical artificial intelligence systems are de...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Low-quality images are an inevitable, intermittent reality for many real-world, computer vision applications. At one extreme, they can be life threatening, such as when they impede the ability of autonomous vehicles

[60] and traffic controllers [30] to safely navigate environments. In other cases, they can serve as irritants when they convey a negative impression to the viewing audiences, such as on social media or dating websites.

Despite that low-quality images often emerge in practical settings, there has largely been a disconnect between research aimed at recognizing quality issues and research aimed at performing downstream vision tasks. For researchers focused on uncovering what quality issues are observed in an image, their progress largely has grown from artificially-constructed settings where they train and evaluate algorithms on publicly-available datasets that were constructed by distorting high quality images to simulate quality issues (e.g., using JPEG compression or Gaussian blur) [41, 47, 12, 21, 37, 36, 25, 31]. Yet, these contrived environments typically lack sufficient sophistication to capture the plethora of factors that contribute to quality issues in natural settings (e.g., camera hardware, lighting, camera shake, scene obstructions). Moreover, the quality issues are detangled from whether they relate to the ability to complete specific vision tasks. As for researchers focusing on specific tasks, much of their progress has developed from environments that lack low-quality images. That is because the creators of popular publicly-available datasets that support the development of such algorithms typically included a step to filter out any candidate images that are deemed insufficient quality for the final dataset [11, 14, 23, 9, 53, 28, 59]. Consequently, such datasets lack data that would enable training algorithms to identify when images are of insufficient quality to complete a given task.

Figure 1: We introduce a new image quality assessment dataset which we call VizWiz-QualityIssues. Shown are examples of the taxonomy of labels, which ranges from no quality issues to six quality flaws to unrecognizable/uncaptionable images. Images can manifest different combinations of the above labels, for instance the unrecognizable image is also labeled as suffering from image blur and poor framing.

Motivated by the aim to tie the assessment of image quality to practical vision tasks, we introduce a new image quality assessment (IQA) dataset that emerges from a real use case. Specifically, our dataset is built around 39,181 images that were taken by people who are blind who were authentically trying to learn about images they took using the VizWiz mobile phone application [5]. Of these images, 17% were submitted to collect image captions from remote humans. The remaining 83% were submitted with a question to collect answers to their visual questions. As discussed in prior work [7, 17], users submitted these images and visual questions (i.e., images with questions) to overcome real visual challenges that they faced in their daily lives. They typically waited nearly two minutes to receive a response from the remote humans [5]. For each image, we asked crowdworkers to either supply a caption describing it or clarify that the quality issues are too severe for them to be able to create a caption. We call this task the unrecognizability classification task. We also ask crowdworkers to label each image with quality flaws that are more traditionally discussed in the literature [7, 12]: blur, overexposure (bright), underexposure (dark), improper framing, obstructions, and rotated views. We call this task the quality flaws classification task. Examples of resulting labeled images in our dataset are shown in Figure 1. Altogether, we call this dataset VizWiz-QualityIssues.

We then demonstrate the value of this new dataset for several new purposes. First, we introduce a novel problem and algorithms for predicting whether an image is sufficient quality to be captioned (Section 4). This can be of immediate use to blind photographers, who otherwise must wait nearly two minutes to learn their image is unsuitable quality for image captioning. We next conduct experiments to demonstrate an additional benefit of this prediction system for creating large-scale image captioning datasets with less wasted human effort (Section 4.3). Finally, we introduce a novel problem and algorithms that inform a user who submits a novel visual question whether it can be answered, cannot be answered because the image content is unrecognizable, or cannot be answered because the image content is missing from the image (Section 5). This too can be of immediate benefit to blind photographers by enabling them to both fail fast and gain valuable insight into how to update the visual question to make it become answerable.

More generally, our work underscores the importance of defining quality within the context of specific tasks. We expect our work can generalize to related vision tasks such as object recognition, scene classification, and video analysis.

2 Related Work

Image Quality Datasets.

A number of image quality datasets exist to support the development of image quality assessment (IQA) algorithms, including LIVE [41, 47], LIVE MD [21], TID2008 [37], TID2013 [36], CSIQ [25], Waterloo Exploration [31], and ESPL-LIVE[24]. A commonality across most such datasets is that they originate from high quality images that were artificially distorted to introduce image quality issues. For example, LIVE [12] consists of 779 distorted images, which are derived by applying five different types of distortions at numerous distortion levels to 29 high-quality images. Yet, image quality issues that arise in real-world settings exhibit distinct appearances than those that are found by simulating distortions to high-quality images. Accordingly, our work complements recent efforts to create large-scale datasets that flag quality issues in natural images [12]. However, our dataset is considerably larger, offering approximately a 19-fold increase in the number of naturally distorted images; i.e., 20,244 in our dataset versus 1,162 images for [12]. In addition, while [12] assigns a single quality score to each image to capture any of a wide array of image quality issues, our work instead focuses on recognizing the presence of each distinct quality issue and assessing the impact of the quality issues on the real application needs of real users.

Image Quality Assessment.

Our work also relates to the literature that introduces methods for assessing the quality of images. One body of work assumes that developers have access to a high-quality version of each novel image, whether partially or completely. For example, distorted images are evaluated against original, intact images for full-reference IQA algorithms [47, 49, 57, 41, 25, 6, 39] and distorted images are evaluated against partial information about the original, intact images for reduced-reference IQA algorithms [50, 26, 48, 42, 38, 32, 51]. Since our natural setting inherently limits us from having access to original, intact images, our work instead aligns with the second body of work which is built around the assumption that no original, reference image is available; i.e., no-reference IQA (NR-IQA). NR-IQA algorithms instead predict a quality score for each novel image [33, 22, 48, 56, 55, 29, 43, 6, 44]. While many algorithms have been introduced for this purpose, our analysis of five popular NR-IQA models (i.e., BRISQUE [33], NIQE [34], CNN-NRIQA [22], DNN-NRIQA [6], and NIMA [44]) demonstrates that they are inadequate for our novel task of assessing which images are unrecognizable and so cannot be captioned (discussed in Section 4). Accordingly, we introduce new algorithms for this purpose, and demonstrate their advantage.

Efficient Creation of Large-Scale Vision Datasets.

Progress in the vision community has largely been measured and accelerated by the creation of large-scale vision datasets over the past 20 years. Typically, researchers have scraped images for such datasets from online image search databases [11, 14, 23, 9, 53, 28, 59]. In doing so, they typically curate a large collection of high-quality images, since such images first passed uploaders’ assessment that they are of sufficient quality to be shared publicly. In contrast, when employing images captured “in the wild,” it can be a costly, time-consuming process to identify and remove images with unrecognizable content. Accordingly, we quantify the cost of this problem, introduce a novel problem and algorithms for deciphering when image content would be unrecognizable to a human and so should be discarded, and demonstrate the benefit of such solutions for more efficiently creating a large-scale image captioning dataset.

Assistive Technology for Blind Photographers.

Our work relates to the literature about technology for assisting people who are blind to take high-quality pictures [1, 5, 20, 45, 58]. Already, existing solutions can assist photographers in improving the image focus [1], lighting [5], and composition [20, 45, 58]. Additionally, algorithms can inform photographers whether their questions about their images can be answered [17] and why crowds struggle to provide answers [4, 15]. Complementing prior work, we introduce a suite of new AI problems and solutions for offering more fine-grained guidance when alerting blind photographers about what image quality issue(s) are observed. Specifically, we introduce novel problems of (1) recognizing whether image content can be recognized (and so captioned) and (2) deciphering when a question about an image can be answered, cannot be answered because the image content is unrecognizable, or cannot be answered because the content of interest is missing from the image.

3 VizWiz-QualityIssues

We now describe our creation of a large-scale, human-labeled dataset to support the development of algorithms that can assess the quality of images. We focus on a real use case that is prone to image quality issues. Specifically, we build off of 39,181 publicly-available images [16, 17] that originate from blind photographers who each submitted an image with, optionally, a question to the VizWiz mobile phone application [5] in order to receive descriptions of the image from remote humans. Since blind photographers are unable to verify the quality of the images they take, the dataset exemplifies the large diversity of quality issues that occur naturally in practice. We describe below how we create and analyze our new dataset.

3.1 Creation of the Dataset

We scoped our dataset around quality issues that impede people who are blind in their daily lives. Specifically, a clear, resounding message is that people who are blind need assistance in taking images that are sufficiently high-quality that sighted people are able to either describe them or answer questions about them [5, 7].

Quality Issues Taxonomy.

One quality issue label we assess is whether image content is sufficiently recognizable for sighted people to caption the images. We also label numerous quality flaws to situate our work in relation to other papers that similarly focus on assessing image quality issues [7, 12]. Specifically, we include the following categories: blur (is the image blurry?), bright (is the image too bright?), dark (is the image too dark?), obstruction (is the scene obscured by the photographer’s finger over the lens, or another unintended object?), framing (are parts of necessary items missing from the image?), rotation (does the image need to be rotated for proper viewing?), other, and no issues (there are no quality issues in the image).

Image Labeling Task.

To efficiently label all images, we designed our task to run on the crowdsourcing platform Amazon Mechanical Turk. The task interface showed an image on the left half and the instructions with user-entry fields on the right half. First, the crowdworker was instructed to either describe the image in one sentence or click a button to flag the image as being insufficient quality to recognize the content (and so not captionable). When the button was clicked, the image description was automatically populated with the following text: “Quality issues are too severe to recognize the visual content.” Next, the crowdworker was instructed to select all image quality flaws from a pre-defined list that are observed. Shown were the six reasons identified above, as well as Other (OTH) linked to a free-entry text-box so other flaws could be described and None (NON) so crowd workers could specify the image had no quality flaws. The interface enabled workers to adjust their view of the image, using the toolbar to zoom in, zoom out, pan around, or rotate the image if needed. To encourage higher quality results, the interface prevented a user from completing the task until a complete sentence was provided and at least one option from the “image quality flaw” options was chosen. A screen shot of the user interface is shown in the Supplementary Materials.

Crowdsourcing Labels.

To support the collection of high quality labels, we only accepted crowdworkers who previously had completed over 500 HITs with at least a 95% acceptance rate. Also, we collected redundant results. Specifically, we recruited five crowdworkers to label each image. We deemed a label as valid only if at least two crowdworkers chose that label.

3.2 Characterization of the Dataset

Prevalence of Quality Issues.

We first examine the frequency at which images taken by people who are blind suffer from the various quality issues to identify the (un)common reasons. To do so, we tally how often unrecognizable images and each quality-flaw arise.

Roughly half of the images suffer from image quality flaws (i.e., 1-(NON)=). We observe that the most common reasons are image blur (i.e., ) and inadequate framing (i.e., ). In contrast, only a small portion of the images are labeled as too bright (i,e., ), too dark (), having objects obscuring the scene (), needing to be rotated for successful viewing (), or other reasons (). The statistics reveal the most promising directions for how to improve assistive photography tools to improve blind users’ experiences. Specifically, the main functions should be focused on camera shake detection and object detection to mitigate the possibility of taking images with blur or framing flaws.

We also observe that the image quality issues are so severe that image content is deemed unrecognizable for 14.8% of the images. In absolute terms, this means that $3,829 and 379 hours of human annotation were wasted on employing crowdworkers to caption images that contained unrecognizable content.111Crowdworkers were paid $0.132 for each image and spent an average of 47 seconds captioning each image. In other words, great savings can be achieved by automatically filtering such uncaptionable images such that they are not sent to crowdworkers. We explore this idea further in Section 4.3.

Likelihood Image Has Unrecognizable Content Given its Quality-Flaw.

We next examine the probability that an image’s content is unrecognizable conditioned on each of the reasons for quality flaws. Results are shown in Figure 


Almost all reasons led to percentages that are larger than the overall percentage of unrecognizable images, which is of all images. This demonstrates what we intuitively suspected, which is that images with quality flaws are more likely to have unrecognizable content. We observe that this trend is the strongest for images that suffer from obstructions (OBS) and inadequate lighting (BRT and DRK), with percentages just over 40%.

Interestingly, two categories have percentages that are smaller than the overall percentage of unrecognizable images, at of all images. First, images that are flagged as needing to be rotated for proper viewing (ROT) have only deemed unrecognizable. In retrospect, this seems understandable, as the content of images with a rotation flaw could still be recognized if viewers tilt their heads (or apply visual display tools to rotate the images). Second, images labeled with no flaws (NON) have only deemed unrecognizable. This tiny amount aligns with the concept that “unrecognizable” and “no flaws” are two conflicting ideas. Still, the fact the percentage is not 0% highlights that humans can offer different perspectives. Put differently, the image quality assessment task can be subjective.

Figure 2: Left: Percentage of images with quality flaws given unrecognizability. Right: Percentage of unrecognizable images given quality flaws.
Figure 3: Interrelation of quality flaws. Values are scaled, with each multiplied by 100. The grid at the -th row and the -th column shows the value of . The diagonal is suppressed for clarity.
Figure 4: Distributions of image quality scores predicted by conventional NR-IQA systems [33, 34, 22, 6, 44] in our new VizWiz-QualityIssues dataset. The heavy overlap of the distributions of scores for recognizable and unrecognizable images reveals that none of the methods are able to distinguish recognizable images from unrecognizable images.

Likelihood Image Has Each Quality-Flaw Given its Content is Unrecognizable.

We next examine the probability that an image manifests each quality flaw given that its content is unrecognizable. Results are shown in Figure 2. Overall, our findings parallel those identified in the “Prevalence of Quality Issues” paragraph. For example, we again observe the most common reasons are blurry images () and improper framing (71.2%). Similarly, unrecognizable images are found to be associated less frequently with the other quality flaws.

Relationship Between Quality Flaws in Images.

Finally, we quantify the relationship between all possible pairs of quality flaws. In doing so, we were motivated to provide a measure that offers insight into causality and co-occurrence when comparing any pair of quality flaws, while avoiding measuring joint probabilities. To meet this aim, we introduce a new measure which we call interrelation index , which is defined as follows:


More details about this measure and the motivation for it are provided in the Supplementary Materials. Briefly, larger positive values indicate that and tend to co-occur with causing to happen more often. Results are shown in Figure 3.

We observe that almost all quality flaws tend to occur with one another, as shown with the positive values of . At first, we were surprised to observe that there is a relationship between BRT and DRK (i.e., is greater than zero), since these flaws are seemingly incompatible concepts. However, from visual inspection of the data, we found some images indeed suffered from both lighting flaws. We exemplify this and other quality flaw correlations in the Supplementary Materials. From our findings, we also observe that “no flaws” does not co-occur with other quality flaws; i.e., the values in the grid are all negative for the row and column for NON. This finding aligns with our intuition that an image labeled with NON is less likely to have a quality flaw at the same time.

4 Classifying Unrecognizable Images

A widespread assumption when captioning images is that the image quality is good enough to recognize the image content. Yet, people who are blind cannot verify the quality of the images they take and it is known their images can be very poor in quality [5, 7, 17]. Accordingly, we now examine the benefit of our large-scale quality dataset for training algorithms to detect when images are unrecognizable and so not captionable.

4.1 Motivation: Inadequate Existing Methods

Before exploring novel algorithms, it is important to first check whether existing methods are suitable for our purposes. Accordingly, we check whether related NR-IQA systems can detect when images are unrecognizable. To do so, we apply five NR-IQA methods on the complete VizWiz-QualityIssues dataset: BRISQUE [33], NIQE [34], CNN-NRIQA [22], DNN-NRIQA [6], and NIMA [44]

. The first two are popular conventional methods that rely on hand-crafted features. The last three are based on neural networks and trained on IQA datasets mentioned in Section

2. For example, DNN-NRIQA-TID and DNN-NRIQA-LIVE in Figure 4 are trained on the TID dataset and LIVE dataset, respectively. Intuitively, if the algorithms are effective for this task, we would expect that the scores for recognizable images are distributed mostly in the high-score region, while the scores for unrecognizable images are distributed mostly in the low-score region.

Results are shown in Figure 4. A key finding is that the distributions of scores for recognizable and unrecognizable images heavily overlap. That is, none of the methods can distinguish recognizable images from unrecognizable images in our dataset. This finding shows that existing methods trained on existing datasets (i.e., LIVE, TID, CSIQ) are unsuitable for our novel task on the VizWiz-QualityIssues dataset. This is possibly in part because quality issues resulting from artificial distortions, such as compression, Gaussian blur, and additive Gaussian noise, differ from natural distortions triggered by poor camera focus, lighting, framing, etc. This also may be because there is no 1-1 mapping between scores indicating overall image quality and our proposed task, since an image with a low quality score may still have recognizable content.

4.2 Proposed Algorithm

Having observed that existing IQA methods are inadequate for our problem, we now introduce models for our novel task of assessing whether an image is recognizable.


We use ResNet-152 [18]

to extract image features, which are then processed by 2-dimensional global-pooling followed by two fully connected layers. The final layer is a single neuron with a sigmoid activation function.

222Due to space constraints, we demonstrate the effectiveness of this architecture for assessing the quality flaws in the Supplementary Materials. The primary difference for that architecture is that we replace ResNet-152 with XceptionNet [8]

, use three fully connected layers, and a final layer of eight neurons with eight sigmoid functions.

We train this algorithm using an Adam optimizer with the learning rate set to 0.001 for 8 epochs. We fix the ResNet weights pre-trained on ImageNet

[9] and only learn the weights in the two fully connected layers.

Dataset Splits.

For training and evaluation of our algorithm, we apply a 52.5%/37.5%/10% split to our dataset to create the training, validation, and test splits.

Avg. precision Recall F1
ResNet-152 80.0 75.1 71.2
Random guessing 16.6 14.6 15.5
SIFT 87.2 42.3 56.9
HOG + linear SVM 56.4 41.2 47.6
Table 1: Performance of algorithms in assessing whether image content can be recognized (and so captioned).
AoANet  [19] full training set 63.3 44.3 29.9 19.7 18.0 44.4 43.6 11.2
perfect flag 63.3 43.8 29.5 19.9 18.1 44.2 43.6 11.5
predicted flag 63.2 44.0 29.5 19.8 18.1 44.2 42.9 11.5
random sample 62.5 43.3 28.8 18.9 18.0 44.1 41.9 11.4
SGAE [54] full training set 62.8 43.3 28.6 18.8 17.3 44.0 32.4 10.4
perfect flag 63.0 43.1 28.6 18.9 17.2 43.9 32.5 10.3
predicted flag 63.1 43.1 28.4 18.7 17.2 44.0 32.4 10.4
random sample 62.4 42.7 27.9 18.2 17.1 43.7 30.4 10.4
Table 2: Performance of two image captioning algorithms with respect to eight metrics trained on the full captioning-training-set, training images annotated to be recognizable (perfect flag), training images predicted to be recognizable (predicted flag), and a subset random sampled from the captioning-training-set. (B@ = BLEU-)


We compare our algorithm to numerous baselines. Included is random guessing, which means an image is unrecognizable with probability . We also analyze a linear SVM that predicts with scale-invariant feature transform (SIFT) features. Intuitively, a low-quality image should have few/no key points. We also evaluate a linear SVM that predicts from histogram of oriented gradients (HOG) features.

Evaluation Metrics.

We evaluate each method using average precision, recall, and f1 scores. Accuracy is excluded because the distributions of unrecognizability are highly biased to “false” and such unbalanced data suffer from the accuracy paradox.


Results are shown in Table 1. We observe that both SIFT and HOG are much stronger baselines than random guessing and get high scores on precision, especially for SIFT. However, they both get low scores on recall. This means that SIFT and HOG are good at capturing a subset of unrecognizable images but still miss many others. On the other hand, the ResNet model gets much higher recall scores while maintaining decent average precision scores, implying that it is more effective at learning the characteristics of unrecognizable images.333Again, due to space constraints, results showing prediction performance for quality flaw classification is in the Supplementary Materials. This is exciting since such an algorithm can be of immediate use to blind photographers, who otherwise must wait nearly two minutes to learn their image is unsuitable quality for image captioning.

4.3 Application: Efficient Dataset Creation

We now examine another potential benefit of our algorithm in helping to create a large scale training dataset.

To support this effort, we divide the dataset into three sets. One set is used to train our image unrecognizability algorithm. A second set is used to train our image captioning algorithms, which we call the captioning-training-set. The third set is used to evaluate our image captioning algorithms, which we call the captioning-evaluation-set.

We use our method to identify which images in the captioning-training-set to use for training image captioning algorithms. In particular, the images flagged as recognizable are included and the remaining images are excluded. We compare this method to three baselines, specifically training on: all images in the captioning-training-set, a random sample of images in the captioning-training-set, a perfect sample of images in the captioning-training-set that are known to be recognizable images.

We evaluate two state-of-art image captioning algorithms, trained independently on each training set, with respect to eight evaluation metrics: BLEU-1-4 

[35], METEOR [10], ROUGE-L [27], CIDEr-D [46], and SPICE [2].

Results are shown in Table 2. Our method performs comparably to when the algorithms were trained on all images as well as the perfect set. In contrast, our method yields improved results over the random sample. Altogether, these findings offer promising evidence that our prediction system is successfully retaining meaningful images while removing images that are not informative for the captioning task (i.e., unrecognizable). This reveals that a benefit of using the recognizability prediction system is to save time and money when crowdsourcing captions (by first removing unrecognizable images), without diminishing the performance of downstream trained image captioning algorithms.

5 Recognizing Unanswerable Visual Questions

The visual question “answerability” problem is to decide whether a visual question can be answered [17]. Yet, as exemplified in Figure 5, visual questions can be unanswerable because the image is unrecognizable or because the answer to the question is missing in a recognizable image. Towards enabling more fine-grained guidance to photographers regarding how to modify the visual question so it is answerable, we move beyond predicting whether a visual question is unanswerable [17] and introduce a novel problem of predicting why a visual question is unanswerable.

Figure 5: Examples of visual questions that are unanswerable for two reasons. The left two examples have unrecognizable images while the right two examples have recognizable images but the content of interest is missing from the field of view. Our posed algorithm correctly predicts why visual questions are unanswerable for these examples.

5.1 Motivation

We extend the VizWiz-VQA dataset [17], which labels each image-question pair as answerable or unanswerable. We inspect how answerability relates to recognizability and each quality flaw. For convenience, we use the following notations: : answerable, : unanswerable, : recognizable: : unrecognizable, : quality issues, and : probability function. Results are shown in Figure 6. We can observe that for most quality flaws , is larger than , and increases to . Additionally, the probability increases from to when questions are known to be unanswerable. Observing that a large reason for unanswerable questions is that images are unrecognizable images, we are motivated to equip VQA systems with a function that is able to clarify why their questions are unanswerable.

Figure 6: Top: Fractions of unanswerable questions conditioned on unrecognizability or a quality flaw. Bottom: Fractions of quality issues and unrecognizable images given answerability. Values are scaled by being multiplied with 100.

5.2 Proposed Algorithm


Our algorithm extends the Up-Down VQA model [3]

. It takes as input encoded image features and a paired question. Image features could be grid-level features extracted by ResNet-152

[18] as well as object-level features extracted by Faster-RCNN [40] or Detectron [13, 52]. The input question is first encoded by a GRU cell. Then, a top-down attention module computes a weighted image feature from the encoded question representation and the input image features. The image and question features are coupled by element-wise multiplication. This coupled feature is processed by the prediction module to predict answerability and recognizability. We employ two different activation functions at the end of the model to make the final prediction. The first one is softmax which predicts three exclusive classes: answerable, unrecognizable, and insufficient content information (answers cannot be found in images). The second activation function is two independent sigmoids, one for answerability and the other for recognizability. We train the network using an Adam optimizer with a learning rate of 0.001, only for the layers after feature extraction.

Dataset Splits.

We split VizWiz dataset into training/validation/test sets according to a 70%/20%/10% ratio.

Evaluation Metrics.

We evaluate performance using average precision, precision, recall, and f1 scores, for which a simple threshold

is used to binarize probability values. For inter-model comparisons, we also report the precision-recall curve for each variant.


For comparison, we consider a number of baselines. One approach is the original model for predicting whether a visual question is answerable, and also employs a top-down attention model 

[17]. We also evaluate the random guessing, SIFT, and HOG baselines used to evaluate the recognizability algorithms in the previous section.

Unans Unrec given unans
AP Rec F1 AP Rec F1
[17] 71.7 64.8
Rand guess 31.1 14.8 20.0
SIFT 94.9 45.3 61.3
HOG 73.1 44.9 55.7
TD+soft 72.6 77.3 67.0 82.2 79.3 75.0
TD+sigm 73.6 71.2 68.0 86.6 79.3 78.6
BU+sigm 73.0 66.6 66.7 87.4 73.7 78.7
TD+BU+sigm 74.0 82.3 67.9 87.7 79.3 79.7
sigm w/o att. 67.7 66.1 64.2 86.7 66.7 74.2
  • TD: top-down attention. BU: bottom-up attention. soft: softmax. sigm: sigmoid. att: attention. AP: average precision. Rec: recall. Unrec: Unrecognizable. Unans: Unanswerable.

  • : Precision is calculated, since true or false is predicted instead of a probability.

Table 3: Performance of predicting why a visual question is unanswerable: unrecognizable image versus unanswerable because the content of interest is missing from the field of view. [17] only predicts answerability and serves as the baseline for unanswerability prediction. Random guessing, SIFT, and HOG only predict recognizability and serve as the baselines for unrecognizability prediction.


Results are shown in Table 3 and Figure 7. Our models perform comparably to the answerability baseline [17]. This is exciting because it shows that jointly learning to predict answerability with recognizability does not degrade the performance; i.e., the average precision scores from TD+softmax and TD+sigmoid models are better than the one from the baseline [17] (, ) as well as the F1 scores (, ).

Our results also highlight the importance of learning to predict jointly the answerability with recognizability task (i.e., rows 5–9) over relying on more basic baselines (i.e., rows 2–4). As shown in Table 3, low recall values imply that SIFT and HOG fail to capture many unrecognizable images, while our models learn image features and excel in recall and f1 scores.

Figure 7: Precision-recall curves for five algorithms predicting unrecognizability when questions are unanswerable.

Next, we compare the results from TD+softmax and TD+sigmoid. We observe they are comparable in unanswerability prediction due to comparable average precision scores and F1 scores. For unrecognizability prediction, TD+softmax is a bit weaker than TD+sigmoid because due to slightly lower average precision and F1 scores. One reason for this may be the manual assignment of unrecognizability to false when answerability is true. Originally, of images are unrecognizable, but after assignment, the portion drops to . Learning from more highly biased data is a harder task, which could in part explain the weaker performance of TD+softmax model.

6 Conclusions

We introduce a new image quality assessment dataset that emerges from an authentic use case where people who are blind struggle to capture high-quality images towards learning about their visual surroundings. We demonstrate the potential of this dataset to encourage the development of new algorithms that can support real users trying to obtain image captions and answers to their visual questions. The dataset and all code are publicly available at


We gratefully acknowledge funding support from the National Science Foundation (IIS-1755593), Microsoft, and Amazon. We thank Nilavra Bhattacharya and the crowdworkers for their valuable contributions to creating the new dataset.


  • [1] . Note: Cited by: §2.
  • [2] P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, pp. 382–398. Cited by: §4.3.
  • [3] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6077–6086. Cited by: §5.2.
  • [4] N. Bhattacharya, Q. Li, and D. Gurari (2019) Why does a visual question have different answers?. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4271–4280. Cited by: §2.
  • [5] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, et al. (2010) VizWiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pp. 333–342. Cited by: §1, §2, §3.1, §3, §4.
  • [6] S. Bosse, D. Maniry, K. Müller, T. Wiegand, and W. Samek (2017) Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on Image Processing 27 (1), pp. 206–219. Cited by: §2, Figure 4, §4.1.
  • [7] E. Brady, M. R. Morris, Y. Zhong, S. White, and J. P. Bigham (2013) Visual challenges in the everyday lives of blind people. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2117–2126. Cited by: §1, §3.1, §3.1, §4.
  • [8] F. Chollet (2017)

    Xception: deep learning with depthwise separable convolutions

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. Cited by: footnote 2.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1, §2, §4.2.
  • [10] M. Denkowski and A. Lavie (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, Cited by: §4.3.
  • [11] L. Fei-Fei, R. Fergus, and P. Perona (2004) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pp. 178–178. Cited by: §1, §2.
  • [12] D. Ghadiyaram and A. C. Bovik (2015) Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing 25 (1), pp. 372–387. Cited by: §1, §1, §2, §3.1.
  • [13] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He (2018) Detectron. Note: Cited by: §5.2.
  • [14] G. Griffin, A. Holub, and P. Perona (2007) Caltech-256 object category dataset. Cited by: §1, §2.
  • [15] D. Gurari and K. Grauman (2017) CrowdVerge: predicting if people will agree on the answer to a visual question. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 3511–3522. Cited by: §2.
  • [16] D. Gurari, Q. Li, C. Lin, Y. Zhao, A. Guo, A. Stangl, and J. P. Bigham (2019) VizWiz-priv: a dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 939–948. Cited by: §3.
  • [17] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham (2018) Vizwiz grand challenge: answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608–3617. Cited by: §1, §2, §3, §4, §5.1, §5.2, §5.2, Table 3, §5.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2, §5.2.
  • [19] L. Huang, W. Wang, J. Chen, and X. Wei (2019) Attention on attention for image captioning. In International Conference on Computer Vision, Cited by: Table 2.
  • [20] C. Jayant, H. Ji, S. White, and J. P. Bigham (2011) Supporting blind photography. In The proceedings of the 13th international ACM SIGACCESS conference on Computers and accessibility, pp. 203–210. Cited by: §2.
  • [21] D. Jayaraman, A. Mittal, A. K. Moorthy, and A. C. Bovik (2012) Objective quality assessment of multiply distorted images. In 2012 Conference record of the forty sixth asilomar conference on signals, systems and computers (ASILOMAR), pp. 1693–1697. Cited by: §1, §2.
  • [22] L. Kang, P. Ye, Y. Li, and D. Doermann (2014) Convolutional neural networks for no-reference image quality assessment. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1733–1740. Cited by: §2, Figure 4, §4.1.
  • [23] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §1, §2.
  • [24] D. Kundu, D. Ghadiyaram, A. Bovik, and B. Evans (2017) Large-scale crowdsourced study for high dynamic range images. IEEE Trans. Image Process. 26 (10), pp. 4725–4740. Cited by: §2.
  • [25] E. C. Larson and D. M. Chandler (2010) Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of Electronic Imaging 19 (1), pp. 011006. Cited by: §1, §2, §2.
  • [26] Q. Li and Z. Wang (2009) Reduced-reference image quality assessment using divisive normalization-based image representation. IEEE journal of selected topics in signal processing 3 (2), pp. 202–211. Cited by: §2.
  • [27] C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.3.
  • [28] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §2.
  • [29] L. Liu, B. Liu, H. Huang, and A. C. Bovik (2014) No-reference image quality assessment based on spatial and spectral entropies. Signal Processing: Image Communication 29 (8), pp. 856–863. Cited by: §2.
  • [30] Y. Lou, Y. Bai, J. Liu, S. Wang, and L. Duan (2019) Veri-wild: a large dataset and a new method for vehicle re-identification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3235–3243. Cited by: §1.
  • [31] K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang (2016) Waterloo exploration database: new challenges for image quality assessment models. IEEE Transactions on Image Processing 26 (2), pp. 1004–1016. Cited by: §1, §2.
  • [32] L. Ma, S. Li, F. Zhang, and K. N. Ngan (2011) Reduced-reference image quality assessment using reorganized dct-based image representation. IEEE Transactions on Multimedia 13 (4), pp. 824–829. Cited by: §2.
  • [33] A. Mittal, A. K. Moorthy, and A. C. Bovik (2012) No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing 21 (12), pp. 4695–4708. Cited by: §2, Figure 4, §4.1.
  • [34] A. Mittal, R. Soundararajan, and A. C. Bovik (2012) Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters 20 (3), pp. 209–212. Cited by: §2, Figure 4, §4.1.
  • [35] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Cited by: §4.3.
  • [36] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, et al. (2015) Image database tid2013: peculiarities, results and perspectives. Signal Processing: Image Communication 30, pp. 57–77. Cited by: §1, §2.
  • [37] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian, M. Carli, and F. Battisti (2009) TID2008-a database for evaluation of full-reference visual quality assessment metrics. Advances of Modern Radioelectronics 10 (4), pp. 30–45. Cited by: §1, §2.
  • [38] A. Rehman and Z. Wang (2012)

    Reduced-reference image quality assessment by structural similarity estimation

    IEEE Transactions on Image Processing 21 (8), pp. 3378–3389. Cited by: §2.
  • [39] R. Reisenhofer, S. Bosse, G. Kutyniok, and T. Wiegand (2018) A haar wavelet-based perceptual similarity index for image quality assessment. Signal Processing: Image Communication 61, pp. 33–43. Cited by: §2.
  • [40] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §5.2.
  • [41] H. R. Sheikh, M. F. Sabir, and A. C. Bovik (2006) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on image processing 15 (11), pp. 3440–3451. Cited by: §1, §2, §2.
  • [42] R. Soundararajan and A. C. Bovik (2011) RRED indices: reduced reference entropic differencing for image quality assessment. IEEE Transactions on Image Processing 21 (2), pp. 517–526. Cited by: §2.
  • [43] S. Suresh, R. V. Babu, and H. J. Kim (2009)

    No-reference image quality assessment using modified extreme learning machine classifier

    Applied Soft Computing 9 (2), pp. 541–552. Cited by: §2.
  • [44] H. Talebi and P. Milanfar (2018) Nima: neural image assessment. IEEE Transactions on Image Processing 27 (8), pp. 3998–4011. Cited by: §2, Figure 4, §4.1.
  • [45] M. Vázquez and A. Steinfeld (2014) An assisted photography framework to help visually impaired users properly aim a camera. ACM Transactions on Computer-Human Interaction (TOCHI) 21 (5), pp. 25. Cited by: §2.
  • [46] R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575. Cited by: §4.3.
  • [47] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §1, §2, §2.
  • [48] Z. Wang and A. C. Bovik (2011) Reduced-and no-reference image quality assessment. IEEE Signal Processing Magazine 28 (6), pp. 29–40. Cited by: §2.
  • [49] Z. Wang, E. P. Simoncelli, and A. C. Bovik (2003) Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2, pp. 1398–1402. Cited by: §2.
  • [50] Z. Wang and E. P. Simoncelli (2005) Reduced-reference image quality assessment using a wavelet-domain natural image statistic model. In Human Vision and Electronic Imaging X, Vol. 5666, pp. 149–159. Cited by: §2.
  • [51] J. Wu, W. Lin, G. Shi, and A. Liu (2013) Reduced-reference image quality assessment with visual information fidelity. IEEE Transactions on Multimedia 15 (7), pp. 1700–1705. Cited by: §2.
  • [52] Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: Cited by: §5.2.
  • [53] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010) Sun database: large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. Cited by: §1, §2.
  • [54] X. Yang, K. Tang, H. Zhang, and J. Cai (2019) Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694. Cited by: Table 2.
  • [55] P. Ye and D. Doermann (2012) No-reference image quality assessment using visual codebooks. IEEE Transactions on Image Processing 21 (7), pp. 3129–3138. Cited by: §2.
  • [56] P. Ye, J. Kumar, L. Kang, and D. Doermann (2012) Unsupervised feature learning framework for no-reference image quality assessment. In 2012 IEEE conference on computer vision and pattern recognition, pp. 1098–1105. Cited by: §2.
  • [57] L. Zhang, L. Zhang, X. Mou, and D. Zhang (2011) FSIM: a feature similarity index for image quality assessment. IEEE transactions on Image Processing 20 (8), pp. 2378–2386. Cited by: §2.
  • [58] Y. Zhong, P. J. Garrigues, and J. P. Bigham (2013) Real time object scanning using a mobile phone and cloud-based visual search engine. In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 20. Cited by: §2.
  • [59] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017) Places: a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2.
  • [60] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, and S. Hu (2016) Traffic-sign detection and classification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2110–2118. Cited by: §1.


This document supplements Sections 3 and 4 of the main paper. In particular, it includes the following:

  • Details and motivation for quality flaw interrelation index (supplements Section 3.2).

  • Result of quality flaw prediction (supplements Section 4.2).

  • Figures illustrating the crowdsourcing interface used to curate our labels (supplements Section 3.1), diversity of resulting unrecognizable images (supplements Section 3.2), performance of our prediction system in classifying unrecognizable images (supplements Section 4.2), and performance of the prediction of the reason for unanswerable questions (supplements Section 5.2).

  • Clarification about baselines used for Section 4.3.

7 Quality flaw interrelation index

Details and motivation

The most straightforward way to explore the relation of two quality flaws and is to look at their the co-occurrence or their joint probability . However, cannot really capture the interrelation between quality flaws. For instance, we cannot say that the relation between DRK and FRM is stronger than the one between DRK and OBS simply because of . The reason for is actually due to but has nothing to do with the interrelation of quality flaws.

Consequently, we introduce a new measure which we call interrelation index , which is defined as follows:


There are several advantages of this measure:

  1. It measures causality from to : we can show that if and are both greater than zero, either or holds. Therefore, if , then the existence of A must trigger B to happen more (i.e., ) and the inexistence of A must make B happen less (i.e., ), and vice versa.

  2. It measures co-occurence of and : We can show that if , then (it is also true for sign). Hence, we have . In other words, if makes happen more often, then must make happen more as well, and vice versa.

  3. It avoids the aforementioned problem of using joint probability. That is, if , it is very likely . However, which of the values of and is greater and how greater it is cannot be told from .

Co-occurrence of DRK and BRT.

Since the values and are both greater than zero, it means that the quality flaws of DRK and BRT tend to co-occur despite their contradictory concepts. Nevertheless, the examples of such images in Figure 9 explain why this phenomenon happens. The main reason for this phenomenon is when blind people take pictures in places with poor lighting, they are not aware that the flashlights on mobile devices are turned on automatically, and therefore pictures taken are usually of dark surroundings and a bright spot. Note that this phenomenon is not captured by the joint probability of DRK and BRT, since is an extremely small value which does not manifest too much.

Co-occurrence of quality flaws.

We exemplify the co-occurrence of other pairs of quality flaws in Figure 10.

Xception precision 72.9 80.1 62.9 58.5 53.6 77.0 72.6 60.0
recall 79.0 80.1 49.8 57.3 39.7 82.4 69.8 9.1
f1 score 75.8 80.1 55.6 57.9 45.6 79.6 71.2 15.8
Random guessing precision 48.6 40.5 4.9 7.2 4.0 55.0 15.6 0.0
recall 50.5 40.3 4.3 6.7 4.0 54.3 15.7 0.0
f1 score 49.5 40.4 4.5 7.0 4.0 54.6 15.6 0.0
Table 4: Performance of quality flaw prediction
Figure 8: Unrecognizable images due to different quality flaws.
Figure 9: Examples of images that are both too dark and too bright. Note that both recognizable and unrecognizable images can appear here, since quality flaws do not necessarily render an image unrecognizable.
Figure 10: Examples of the co-occurrence of all quality flaw pairs. Again we obsere both recognizable and unrecognizable images appear since quality flaws do not necessarily render an image unrecognizable.

8 Quality flaw prediction

Performance of quality flaw classification is shown in Table 4. We can tell that the Xception model outperforms the random guessing baseline for each quality flaw, with respect to precision, recall, and f1 score. Furthermore, Xception predicts much better in NON, BLR and FRM flaws for large portions of the dataset. On the other hand, quality flaws that represent small portions of the dataset are prone to few-shot learning, and so learning to predict them is harder. In the extreme case of OTH, with it representing of the data, the Xception model yields very poor scores of and for recall and f1 score, respectively.

9 Miscellaneous

  • Figure 8 illustrates the diversity of unrecognizable images that can arise from different quality flaws.

  • Figure 11 shows a screen shot of the crowdsourcing interface used to collect the labels for the dataset.

  • Figure 12 shows the examples of unrecognizability prediction by the Xception model.

  • Figure 13 shows the examples of the prediction of the reason for unanswerable questions. The prediction model used is “TD+sigmoid” model.

10 Section 4.3: Clarification about Baselines

The two baselines, “random flag” and “perfect flag”, use the same number of images from the captioning-training-set as our method for algorithm training. That count is determined by our predictor, specifically the number of images that remain after removing all images that are deemed to be unrecognizable. “Random flag” chooses a random sample from the captioning-training-set. “Perfect flag” chooses images based on a ranking of images based on how many crowdworkers flag the images as unrecognizable, with selection starting from those where all five crowdworkers agreed the image is unrecognizable.

Figure 11: Interface used to crowdsource the collection of image captions.
Figure 12: Examples of true-positives (TP), true-negatives (TN), false-positives (FP), and false-negatives (FN) in unrecognizability prediction. TP: unrecognizable images predicted to be unrecognizable. TN: recognizable images predicted to be recognizable. FP: recognizable images predicted to be unrecognizable. FN: unrecognizable images predicted to be recognizable.
Figure 13: Prediction of the reason for unanswerable questions. Note that each visual question pair here is unanswerable. (a) Unanswerable questions are due to unrecognizable images and so are predictions. (b) Unanswerable questions are due to insufficient content and so are predictions. (c) Unanswerable questions are due to insufficient content but predicted to be due to unrecognizable images. (d) Unanswerable questions are due to unrecognizable images but predicted to be due to insufficient content.