Low-quality images are an inevitable, intermittent reality for many real-world, computer vision applications. At one extreme, they can be life threatening, such as when they impede the ability of autonomous vehicles and traffic controllers  to safely navigate environments. In other cases, they can serve as irritants when they convey a negative impression to the viewing audiences, such as on social media or dating websites.
Despite that low-quality images often emerge in practical settings, there has largely been a disconnect between research aimed at recognizing quality issues and research aimed at performing downstream vision tasks. For researchers focused on uncovering what quality issues are observed in an image, their progress largely has grown from artificially-constructed settings where they train and evaluate algorithms on publicly-available datasets that were constructed by distorting high quality images to simulate quality issues (e.g., using JPEG compression or Gaussian blur) [41, 47, 12, 21, 37, 36, 25, 31]. Yet, these contrived environments typically lack sufficient sophistication to capture the plethora of factors that contribute to quality issues in natural settings (e.g., camera hardware, lighting, camera shake, scene obstructions). Moreover, the quality issues are detangled from whether they relate to the ability to complete specific vision tasks. As for researchers focusing on specific tasks, much of their progress has developed from environments that lack low-quality images. That is because the creators of popular publicly-available datasets that support the development of such algorithms typically included a step to filter out any candidate images that are deemed insufficient quality for the final dataset [11, 14, 23, 9, 53, 28, 59]. Consequently, such datasets lack data that would enable training algorithms to identify when images are of insufficient quality to complete a given task.
Motivated by the aim to tie the assessment of image quality to practical vision tasks, we introduce a new image quality assessment (IQA) dataset that emerges from a real use case. Specifically, our dataset is built around 39,181 images that were taken by people who are blind who were authentically trying to learn about images they took using the VizWiz mobile phone application . Of these images, 17% were submitted to collect image captions from remote humans. The remaining 83% were submitted with a question to collect answers to their visual questions. As discussed in prior work [7, 17], users submitted these images and visual questions (i.e., images with questions) to overcome real visual challenges that they faced in their daily lives. They typically waited nearly two minutes to receive a response from the remote humans . For each image, we asked crowdworkers to either supply a caption describing it or clarify that the quality issues are too severe for them to be able to create a caption. We call this task the unrecognizability classification task. We also ask crowdworkers to label each image with quality flaws that are more traditionally discussed in the literature [7, 12]: blur, overexposure (bright), underexposure (dark), improper framing, obstructions, and rotated views. We call this task the quality flaws classification task. Examples of resulting labeled images in our dataset are shown in Figure 1. Altogether, we call this dataset VizWiz-QualityIssues.
We then demonstrate the value of this new dataset for several new purposes. First, we introduce a novel problem and algorithms for predicting whether an image is sufficient quality to be captioned (Section 4). This can be of immediate use to blind photographers, who otherwise must wait nearly two minutes to learn their image is unsuitable quality for image captioning. We next conduct experiments to demonstrate an additional benefit of this prediction system for creating large-scale image captioning datasets with less wasted human effort (Section 4.3). Finally, we introduce a novel problem and algorithms that inform a user who submits a novel visual question whether it can be answered, cannot be answered because the image content is unrecognizable, or cannot be answered because the image content is missing from the image (Section 5). This too can be of immediate benefit to blind photographers by enabling them to both fail fast and gain valuable insight into how to update the visual question to make it become answerable.
More generally, our work underscores the importance of defining quality within the context of specific tasks. We expect our work can generalize to related vision tasks such as object recognition, scene classification, and video analysis.
2 Related Work
Image Quality Datasets.
A number of image quality datasets exist to support the development of image quality assessment (IQA) algorithms, including LIVE [41, 47], LIVE MD , TID2008 , TID2013 , CSIQ , Waterloo Exploration , and ESPL-LIVE. A commonality across most such datasets is that they originate from high quality images that were artificially distorted to introduce image quality issues. For example, LIVE  consists of 779 distorted images, which are derived by applying five different types of distortions at numerous distortion levels to 29 high-quality images. Yet, image quality issues that arise in real-world settings exhibit distinct appearances than those that are found by simulating distortions to high-quality images. Accordingly, our work complements recent efforts to create large-scale datasets that flag quality issues in natural images . However, our dataset is considerably larger, offering approximately a 19-fold increase in the number of naturally distorted images; i.e., 20,244 in our dataset versus 1,162 images for . In addition, while  assigns a single quality score to each image to capture any of a wide array of image quality issues, our work instead focuses on recognizing the presence of each distinct quality issue and assessing the impact of the quality issues on the real application needs of real users.
Image Quality Assessment.
Our work also relates to the literature that introduces methods for assessing the quality of images. One body of work assumes that developers have access to a high-quality version of each novel image, whether partially or completely. For example, distorted images are evaluated against original, intact images for full-reference IQA algorithms [47, 49, 57, 41, 25, 6, 39] and distorted images are evaluated against partial information about the original, intact images for reduced-reference IQA algorithms [50, 26, 48, 42, 38, 32, 51]. Since our natural setting inherently limits us from having access to original, intact images, our work instead aligns with the second body of work which is built around the assumption that no original, reference image is available; i.e., no-reference IQA (NR-IQA). NR-IQA algorithms instead predict a quality score for each novel image [33, 22, 48, 56, 55, 29, 43, 6, 44]. While many algorithms have been introduced for this purpose, our analysis of five popular NR-IQA models (i.e., BRISQUE , NIQE , CNN-NRIQA , DNN-NRIQA , and NIMA ) demonstrates that they are inadequate for our novel task of assessing which images are unrecognizable and so cannot be captioned (discussed in Section 4). Accordingly, we introduce new algorithms for this purpose, and demonstrate their advantage.
Efficient Creation of Large-Scale Vision Datasets.
Progress in the vision community has largely been measured and accelerated by the creation of large-scale vision datasets over the past 20 years. Typically, researchers have scraped images for such datasets from online image search databases [11, 14, 23, 9, 53, 28, 59]. In doing so, they typically curate a large collection of high-quality images, since such images first passed uploaders’ assessment that they are of sufficient quality to be shared publicly. In contrast, when employing images captured “in the wild,” it can be a costly, time-consuming process to identify and remove images with unrecognizable content. Accordingly, we quantify the cost of this problem, introduce a novel problem and algorithms for deciphering when image content would be unrecognizable to a human and so should be discarded, and demonstrate the benefit of such solutions for more efficiently creating a large-scale image captioning dataset.
Assistive Technology for Blind Photographers.
Our work relates to the literature about technology for assisting people who are blind to take high-quality pictures [1, 5, 20, 45, 58]. Already, existing solutions can assist photographers in improving the image focus , lighting , and composition [20, 45, 58]. Additionally, algorithms can inform photographers whether their questions about their images can be answered  and why crowds struggle to provide answers [4, 15]. Complementing prior work, we introduce a suite of new AI problems and solutions for offering more fine-grained guidance when alerting blind photographers about what image quality issue(s) are observed. Specifically, we introduce novel problems of (1) recognizing whether image content can be recognized (and so captioned) and (2) deciphering when a question about an image can be answered, cannot be answered because the image content is unrecognizable, or cannot be answered because the content of interest is missing from the image.
We now describe our creation of a large-scale, human-labeled dataset to support the development of algorithms that can assess the quality of images. We focus on a real use case that is prone to image quality issues. Specifically, we build off of 39,181 publicly-available images [16, 17] that originate from blind photographers who each submitted an image with, optionally, a question to the VizWiz mobile phone application  in order to receive descriptions of the image from remote humans. Since blind photographers are unable to verify the quality of the images they take, the dataset exemplifies the large diversity of quality issues that occur naturally in practice. We describe below how we create and analyze our new dataset.
3.1 Creation of the Dataset
We scoped our dataset around quality issues that impede people who are blind in their daily lives. Specifically, a clear, resounding message is that people who are blind need assistance in taking images that are sufficiently high-quality that sighted people are able to either describe them or answer questions about them [5, 7].
Quality Issues Taxonomy.
One quality issue label we assess is whether image content is sufficiently recognizable for sighted people to caption the images. We also label numerous quality flaws to situate our work in relation to other papers that similarly focus on assessing image quality issues [7, 12]. Specifically, we include the following categories: blur (is the image blurry?), bright (is the image too bright?), dark (is the image too dark?), obstruction (is the scene obscured by the photographer’s finger over the lens, or another unintended object?), framing (are parts of necessary items missing from the image?), rotation (does the image need to be rotated for proper viewing?), other, and no issues (there are no quality issues in the image).
Image Labeling Task.
To efficiently label all images, we designed our task to run on the crowdsourcing platform Amazon Mechanical Turk. The task interface showed an image on the left half and the instructions with user-entry fields on the right half. First, the crowdworker was instructed to either describe the image in one sentence or click a button to flag the image as being insufficient quality to recognize the content (and so not captionable). When the button was clicked, the image description was automatically populated with the following text: “Quality issues are too severe to recognize the visual content.” Next, the crowdworker was instructed to select all image quality flaws from a pre-defined list that are observed. Shown were the six reasons identified above, as well as Other (OTH) linked to a free-entry text-box so other flaws could be described and None (NON) so crowd workers could specify the image had no quality flaws. The interface enabled workers to adjust their view of the image, using the toolbar to zoom in, zoom out, pan around, or rotate the image if needed. To encourage higher quality results, the interface prevented a user from completing the task until a complete sentence was provided and at least one option from the “image quality flaw” options was chosen. A screen shot of the user interface is shown in the Supplementary Materials.
To support the collection of high quality labels, we only accepted crowdworkers who previously had completed over 500 HITs with at least a 95% acceptance rate. Also, we collected redundant results. Specifically, we recruited five crowdworkers to label each image. We deemed a label as valid only if at least two crowdworkers chose that label.
3.2 Characterization of the Dataset
Prevalence of Quality Issues.
We first examine the frequency at which images taken by people who are blind suffer from the various quality issues to identify the (un)common reasons. To do so, we tally how often unrecognizable images and each quality-flaw arise.
Roughly half of the images suffer from image quality flaws (i.e., 1-(NON)=). We observe that the most common reasons are image blur (i.e., ) and inadequate framing (i.e., ). In contrast, only a small portion of the images are labeled as too bright (i,e., ), too dark (), having objects obscuring the scene (), needing to be rotated for successful viewing (), or other reasons (). The statistics reveal the most promising directions for how to improve assistive photography tools to improve blind users’ experiences. Specifically, the main functions should be focused on camera shake detection and object detection to mitigate the possibility of taking images with blur or framing flaws.
We also observe that the image quality issues are so severe that image content is deemed unrecognizable for 14.8% of the images. In absolute terms, this means that $3,829 and 379 hours of human annotation were wasted on employing crowdworkers to caption images that contained unrecognizable content.111Crowdworkers were paid $0.132 for each image and spent an average of 47 seconds captioning each image. In other words, great savings can be achieved by automatically filtering such uncaptionable images such that they are not sent to crowdworkers. We explore this idea further in Section 4.3.
Likelihood Image Has Unrecognizable Content Given its Quality-Flaw.
We next examine the probability that an image’s content is unrecognizable conditioned on each of the reasons for quality flaws. Results are shown in Figure2.
Almost all reasons led to percentages that are larger than the overall percentage of unrecognizable images, which is of all images. This demonstrates what we intuitively suspected, which is that images with quality flaws are more likely to have unrecognizable content. We observe that this trend is the strongest for images that suffer from obstructions (OBS) and inadequate lighting (BRT and DRK), with percentages just over 40%.
Interestingly, two categories have percentages that are smaller than the overall percentage of unrecognizable images, at of all images. First, images that are flagged as needing to be rotated for proper viewing (ROT) have only deemed unrecognizable. In retrospect, this seems understandable, as the content of images with a rotation flaw could still be recognized if viewers tilt their heads (or apply visual display tools to rotate the images). Second, images labeled with no flaws (NON) have only deemed unrecognizable. This tiny amount aligns with the concept that “unrecognizable” and “no flaws” are two conflicting ideas. Still, the fact the percentage is not 0% highlights that humans can offer different perspectives. Put differently, the image quality assessment task can be subjective.
Likelihood Image Has Each Quality-Flaw Given its Content is Unrecognizable.
We next examine the probability that an image manifests each quality flaw given that its content is unrecognizable. Results are shown in Figure 2. Overall, our findings parallel those identified in the “Prevalence of Quality Issues” paragraph. For example, we again observe the most common reasons are blurry images () and improper framing (71.2%). Similarly, unrecognizable images are found to be associated less frequently with the other quality flaws.
Relationship Between Quality Flaws in Images.
Finally, we quantify the relationship between all possible pairs of quality flaws. In doing so, we were motivated to provide a measure that offers insight into causality and co-occurrence when comparing any pair of quality flaws, while avoiding measuring joint probabilities. To meet this aim, we introduce a new measure which we call interrelation index , which is defined as follows:
More details about this measure and the motivation for it are provided in the Supplementary Materials. Briefly, larger positive values indicate that and tend to co-occur with causing to happen more often. Results are shown in Figure 3.
We observe that almost all quality flaws tend to occur with one another, as shown with the positive values of . At first, we were surprised to observe that there is a relationship between BRT and DRK (i.e., is greater than zero), since these flaws are seemingly incompatible concepts. However, from visual inspection of the data, we found some images indeed suffered from both lighting flaws. We exemplify this and other quality flaw correlations in the Supplementary Materials. From our findings, we also observe that “no flaws” does not co-occur with other quality flaws; i.e., the values in the grid are all negative for the row and column for NON. This finding aligns with our intuition that an image labeled with NON is less likely to have a quality flaw at the same time.
4 Classifying Unrecognizable Images
A widespread assumption when captioning images is that the image quality is good enough to recognize the image content. Yet, people who are blind cannot verify the quality of the images they take and it is known their images can be very poor in quality [5, 7, 17]. Accordingly, we now examine the benefit of our large-scale quality dataset for training algorithms to detect when images are unrecognizable and so not captionable.
4.1 Motivation: Inadequate Existing Methods
Before exploring novel algorithms, it is important to first check whether existing methods are suitable for our purposes. Accordingly, we check whether related NR-IQA systems can detect when images are unrecognizable. To do so, we apply five NR-IQA methods on the complete VizWiz-QualityIssues dataset: BRISQUE , NIQE , CNN-NRIQA , DNN-NRIQA , and NIMA 
. The first two are popular conventional methods that rely on hand-crafted features. The last three are based on neural networks and trained on IQA datasets mentioned in Section2. For example, DNN-NRIQA-TID and DNN-NRIQA-LIVE in Figure 4 are trained on the TID dataset and LIVE dataset, respectively. Intuitively, if the algorithms are effective for this task, we would expect that the scores for recognizable images are distributed mostly in the high-score region, while the scores for unrecognizable images are distributed mostly in the low-score region.
Results are shown in Figure 4. A key finding is that the distributions of scores for recognizable and unrecognizable images heavily overlap. That is, none of the methods can distinguish recognizable images from unrecognizable images in our dataset. This finding shows that existing methods trained on existing datasets (i.e., LIVE, TID, CSIQ) are unsuitable for our novel task on the VizWiz-QualityIssues dataset. This is possibly in part because quality issues resulting from artificial distortions, such as compression, Gaussian blur, and additive Gaussian noise, differ from natural distortions triggered by poor camera focus, lighting, framing, etc. This also may be because there is no 1-1 mapping between scores indicating overall image quality and our proposed task, since an image with a low quality score may still have recognizable content.
4.2 Proposed Algorithm
Having observed that existing IQA methods are inadequate for our problem, we now introduce models for our novel task of assessing whether an image is recognizable.
We use ResNet-152 222Due to space constraints, we demonstrate the effectiveness of this architecture for assessing the quality flaws in the Supplementary Materials. The primary difference for that architecture is that we replace ResNet-152 with XceptionNet 
, use three fully connected layers, and a final layer of eight neurons with eight sigmoid functions.9] and only learn the weights in the two fully connected layers.
For training and evaluation of our algorithm, we apply a 52.5%/37.5%/10% split to our dataset to create the training, validation, and test splits.
|HOG + linear SVM||56.4||41.2||47.6|
|AoANet ||full training set||63.3||44.3||29.9||19.7||18.0||44.4||43.6||11.2|
|SGAE ||full training set||62.8||43.3||28.6||18.8||17.3||44.0||32.4||10.4|
We compare our algorithm to numerous baselines. Included is random guessing, which means an image is unrecognizable with probability . We also analyze a linear SVM that predicts with scale-invariant feature transform (SIFT) features. Intuitively, a low-quality image should have few/no key points. We also evaluate a linear SVM that predicts from histogram of oriented gradients (HOG) features.
We evaluate each method using average precision, recall, and f1 scores. Accuracy is excluded because the distributions of unrecognizability are highly biased to “false” and such unbalanced data suffer from the accuracy paradox.
Results are shown in Table 1. We observe that both SIFT and HOG are much stronger baselines than random guessing and get high scores on precision, especially for SIFT. However, they both get low scores on recall. This means that SIFT and HOG are good at capturing a subset of unrecognizable images but still miss many others. On the other hand, the ResNet model gets much higher recall scores while maintaining decent average precision scores, implying that it is more effective at learning the characteristics of unrecognizable images.333Again, due to space constraints, results showing prediction performance for quality flaw classification is in the Supplementary Materials. This is exciting since such an algorithm can be of immediate use to blind photographers, who otherwise must wait nearly two minutes to learn their image is unsuitable quality for image captioning.
4.3 Application: Efficient Dataset Creation
We now examine another potential benefit of our algorithm in helping to create a large scale training dataset.
To support this effort, we divide the dataset into three sets. One set is used to train our image unrecognizability algorithm. A second set is used to train our image captioning algorithms, which we call the captioning-training-set. The third set is used to evaluate our image captioning algorithms, which we call the captioning-evaluation-set.
We use our method to identify which images in the captioning-training-set to use for training image captioning algorithms. In particular, the images flagged as recognizable are included and the remaining images are excluded. We compare this method to three baselines, specifically training on: all images in the captioning-training-set, a random sample of images in the captioning-training-set, a perfect sample of images in the captioning-training-set that are known to be recognizable images.
We evaluate two state-of-art image captioning algorithms, trained independently on each training set, with respect to eight evaluation metrics: BLEU-1-4, METEOR , ROUGE-L , CIDEr-D , and SPICE .
Results are shown in Table 2. Our method performs comparably to when the algorithms were trained on all images as well as the perfect set. In contrast, our method yields improved results over the random sample. Altogether, these findings offer promising evidence that our prediction system is successfully retaining meaningful images while removing images that are not informative for the captioning task (i.e., unrecognizable). This reveals that a benefit of using the recognizability prediction system is to save time and money when crowdsourcing captions (by first removing unrecognizable images), without diminishing the performance of downstream trained image captioning algorithms.
5 Recognizing Unanswerable Visual Questions
The visual question “answerability” problem is to decide whether a visual question can be answered . Yet, as exemplified in Figure 5, visual questions can be unanswerable because the image is unrecognizable or because the answer to the question is missing in a recognizable image. Towards enabling more fine-grained guidance to photographers regarding how to modify the visual question so it is answerable, we move beyond predicting whether a visual question is unanswerable  and introduce a novel problem of predicting why a visual question is unanswerable.
We extend the VizWiz-VQA dataset , which labels each image-question pair as answerable or unanswerable. We inspect how answerability relates to recognizability and each quality flaw. For convenience, we use the following notations: : answerable, : unanswerable, : recognizable: : unrecognizable, : quality issues, and : probability function. Results are shown in Figure 6. We can observe that for most quality flaws , is larger than , and increases to . Additionally, the probability increases from to when questions are known to be unanswerable. Observing that a large reason for unanswerable questions is that images are unrecognizable images, we are motivated to equip VQA systems with a function that is able to clarify why their questions are unanswerable.
5.2 Proposed Algorithm
Our algorithm extends the Up-Down VQA model 
. It takes as input encoded image features and a paired question. Image features could be grid-level features extracted by ResNet-152 as well as object-level features extracted by Faster-RCNN  or Detectron [13, 52]. The input question is first encoded by a GRU cell. Then, a top-down attention module computes a weighted image feature from the encoded question representation and the input image features. The image and question features are coupled by element-wise multiplication. This coupled feature is processed by the prediction module to predict answerability and recognizability. We employ two different activation functions at the end of the model to make the final prediction. The first one is softmax which predicts three exclusive classes: answerable, unrecognizable, and insufficient content information (answers cannot be found in images). The second activation function is two independent sigmoids, one for answerability and the other for recognizability. We train the network using an Adam optimizer with a learning rate of 0.001, only for the layers after feature extraction.
We split VizWiz dataset into training/validation/test sets according to a 70%/20%/10% ratio.
We evaluate performance using average precision, precision, recall, and f1 scores, for which a simple threshold
is used to binarize probability values. For inter-model comparisons, we also report the precision-recall curve for each variant.
For comparison, we consider a number of baselines. One approach is the original model for predicting whether a visual question is answerable, and also employs a top-down attention model. We also evaluate the random guessing, SIFT, and HOG baselines used to evaluate the recognizability algorithms in the previous section.
|Unans||Unrec given unans|
|sigm w/o att.||67.7||66.1||64.2||86.7||66.7||74.2|
TD: top-down attention. BU: bottom-up attention. soft: softmax. sigm: sigmoid. att: attention. AP: average precision. Rec: recall. Unrec: Unrecognizable. Unans: Unanswerable.
: Precision is calculated, since true or false is predicted instead of a probability.
Results are shown in Table 3 and Figure 7. Our models perform comparably to the answerability baseline . This is exciting because it shows that jointly learning to predict answerability with recognizability does not degrade the performance; i.e., the average precision scores from TD+softmax and TD+sigmoid models are better than the one from the baseline  (, ) as well as the F1 scores (, ).
Our results also highlight the importance of learning to predict jointly the answerability with recognizability task (i.e., rows 5–9) over relying on more basic baselines (i.e., rows 2–4). As shown in Table 3, low recall values imply that SIFT and HOG fail to capture many unrecognizable images, while our models learn image features and excel in recall and f1 scores.
Next, we compare the results from TD+softmax and TD+sigmoid. We observe they are comparable in unanswerability prediction due to comparable average precision scores and F1 scores. For unrecognizability prediction, TD+softmax is a bit weaker than TD+sigmoid because due to slightly lower average precision and F1 scores. One reason for this may be the manual assignment of unrecognizability to false when answerability is true. Originally, of images are unrecognizable, but after assignment, the portion drops to . Learning from more highly biased data is a harder task, which could in part explain the weaker performance of TD+softmax model.
We introduce a new image quality assessment dataset that emerges from an authentic use case where people who are blind struggle to capture high-quality images towards learning about their visual surroundings. We demonstrate the potential of this dataset to encourage the development of new algorithms that can support real users trying to obtain image captions and answers to their visual questions. The dataset and all code are publicly available at https://vizwiz.org.
We gratefully acknowledge funding support from the National Science Foundation (IIS-1755593), Microsoft, and Amazon. We thank Nilavra Bhattacharya and the crowdworkers for their valuable contributions to creating the new dataset.
-  . Note: http://www.taptapseeapp.com/ Cited by: §2.
-  (2016) Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, pp. 382–398. Cited by: §4.3.
Bottom-up and top-down attention for image captioning and visual question answering.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086. Cited by: §5.2.
-  (2019) Why does a visual question have different answers?. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4271–4280. Cited by: §2.
-  (2010) VizWiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pp. 333–342. Cited by: §1, §2, §3.1, §3, §4.
-  (2017) Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on Image Processing 27 (1), pp. 206–219. Cited by: §2, Figure 4, §4.1.
-  (2013) Visual challenges in the everyday lives of blind people. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2117–2126. Cited by: §1, §3.1, §3.1, §4.
Xception: deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. Cited by: footnote 2.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1, §2, §4.2.
-  (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, Cited by: §4.3.
-  (2004) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pp. 178–178. Cited by: §1, §2.
-  (2015) Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing 25 (1), pp. 372–387. Cited by: §1, §1, §2, §3.1.
-  (2018) Detectron. Note: https://github.com/facebookresearch/detectron Cited by: §5.2.
-  (2007) Caltech-256 object category dataset. Cited by: §1, §2.
-  (2017) CrowdVerge: predicting if people will agree on the answer to a visual question. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 3511–3522. Cited by: §2.
-  (2019) VizWiz-priv: a dataset for recognizing the presence and purpose of private visual information in images taken by blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 939–948. Cited by: §3.
-  (2018) Vizwiz grand challenge: answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608–3617. Cited by: §1, §2, §3, §4, §5.1, §5.2, §5.2, Table 3, §5.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2, §5.2.
-  (2019) Attention on attention for image captioning. In International Conference on Computer Vision, Cited by: Table 2.
-  (2011) Supporting blind photography. In The proceedings of the 13th international ACM SIGACCESS conference on Computers and accessibility, pp. 203–210. Cited by: §2.
-  (2012) Objective quality assessment of multiply distorted images. In 2012 Conference record of the forty sixth asilomar conference on signals, systems and computers (ASILOMAR), pp. 1693–1697. Cited by: §1, §2.
-  (2014) Convolutional neural networks for no-reference image quality assessment. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1733–1740. Cited by: §2, Figure 4, §4.1.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §1, §2.
-  (2017) Large-scale crowdsourced study for high dynamic range images. IEEE Trans. Image Process. 26 (10), pp. 4725–4740. Cited by: §2.
-  (2010) Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of Electronic Imaging 19 (1), pp. 011006. Cited by: §1, §2, §2.
-  (2009) Reduced-reference image quality assessment using divisive normalization-based image representation. IEEE journal of selected topics in signal processing 3 (2), pp. 202–211. Cited by: §2.
-  (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.3.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §2.
-  (2014) No-reference image quality assessment based on spatial and spectral entropies. Signal Processing: Image Communication 29 (8), pp. 856–863. Cited by: §2.
-  (2019) Veri-wild: a large dataset and a new method for vehicle re-identification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3235–3243. Cited by: §1.
-  (2016) Waterloo exploration database: new challenges for image quality assessment models. IEEE Transactions on Image Processing 26 (2), pp. 1004–1016. Cited by: §1, §2.
-  (2011) Reduced-reference image quality assessment using reorganized dct-based image representation. IEEE Transactions on Multimedia 13 (4), pp. 824–829. Cited by: §2.
-  (2012) No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing 21 (12), pp. 4695–4708. Cited by: §2, Figure 4, §4.1.
-  (2012) Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters 20 (3), pp. 209–212. Cited by: §2, Figure 4, §4.1.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Cited by: §4.3.
-  (2015) Image database tid2013: peculiarities, results and perspectives. Signal Processing: Image Communication 30, pp. 57–77. Cited by: §1, §2.
-  (2009) TID2008-a database for evaluation of full-reference visual quality assessment metrics. Advances of Modern Radioelectronics 10 (4), pp. 30–45. Cited by: §1, §2.
Reduced-reference image quality assessment by structural similarity estimation. IEEE Transactions on Image Processing 21 (8), pp. 3378–3389. Cited by: §2.
-  (2018) A haar wavelet-based perceptual similarity index for image quality assessment. Signal Processing: Image Communication 61, pp. 33–43. Cited by: §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §5.2.
-  (2006) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on image processing 15 (11), pp. 3440–3451. Cited by: §1, §2, §2.
-  (2011) RRED indices: reduced reference entropic differencing for image quality assessment. IEEE Transactions on Image Processing 21 (2), pp. 517–526. Cited by: §2.
No-reference image quality assessment using modified extreme learning machine classifier. Applied Soft Computing 9 (2), pp. 541–552. Cited by: §2.
-  (2018) Nima: neural image assessment. IEEE Transactions on Image Processing 27 (8), pp. 3998–4011. Cited by: §2, Figure 4, §4.1.
-  (2014) An assisted photography framework to help visually impaired users properly aim a camera. ACM Transactions on Computer-Human Interaction (TOCHI) 21 (5), pp. 25. Cited by: §2.
-  (2015) Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575. Cited by: §4.3.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §1, §2, §2.
-  (2011) Reduced-and no-reference image quality assessment. IEEE Signal Processing Magazine 28 (6), pp. 29–40. Cited by: §2.
-  (2003) Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2, pp. 1398–1402. Cited by: §2.
-  (2005) Reduced-reference image quality assessment using a wavelet-domain natural image statistic model. In Human Vision and Electronic Imaging X, Vol. 5666, pp. 149–159. Cited by: §2.
-  (2013) Reduced-reference image quality assessment with visual information fidelity. IEEE Transactions on Multimedia 15 (7), pp. 1700–1705. Cited by: §2.
-  (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §5.2.
-  (2010) Sun database: large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. Cited by: §1, §2.
-  (2019) Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694. Cited by: Table 2.
-  (2012) No-reference image quality assessment using visual codebooks. IEEE Transactions on Image Processing 21 (7), pp. 3129–3138. Cited by: §2.
-  (2012) Unsupervised feature learning framework for no-reference image quality assessment. In 2012 IEEE conference on computer vision and pattern recognition, pp. 1098–1105. Cited by: §2.
-  (2011) FSIM: a feature similarity index for image quality assessment. IEEE transactions on Image Processing 20 (8), pp. 2378–2386. Cited by: §2.
-  (2013) Real time object scanning using a mobile phone and cloud-based visual search engine. In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 20. Cited by: §2.
-  (2017) Places: a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2.
-  (2016) Traffic-sign detection and classification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2110–2118. Cited by: §1.
This document supplements Sections 3 and 4 of the main paper. In particular, it includes the following:
Details and motivation for quality flaw interrelation index (supplements Section 3.2).
Result of quality flaw prediction (supplements Section 4.2).
Figures illustrating the crowdsourcing interface used to curate our labels (supplements Section 3.1), diversity of resulting unrecognizable images (supplements Section 3.2), performance of our prediction system in classifying unrecognizable images (supplements Section 4.2), and performance of the prediction of the reason for unanswerable questions (supplements Section 5.2).
Clarification about baselines used for Section 4.3.
7 Quality flaw interrelation index
Details and motivation
The most straightforward way to explore the relation of two quality flaws and is to look at their the co-occurrence or their joint probability . However, cannot really capture the interrelation between quality flaws. For instance, we cannot say that the relation between DRK and FRM is stronger than the one between DRK and OBS simply because of . The reason for is actually due to but has nothing to do with the interrelation of quality flaws.
Consequently, we introduce a new measure which we call interrelation index , which is defined as follows:
There are several advantages of this measure:
It measures causality from to : we can show that if and are both greater than zero, either or holds. Therefore, if , then the existence of A must trigger B to happen more (i.e., ) and the inexistence of A must make B happen less (i.e., ), and vice versa.
It measures co-occurence of and : We can show that if , then (it is also true for sign). Hence, we have . In other words, if makes happen more often, then must make happen more as well, and vice versa.
It avoids the aforementioned problem of using joint probability. That is, if , it is very likely . However, which of the values of and is greater and how greater it is cannot be told from .
Co-occurrence of DRK and BRT.
Since the values and are both greater than zero, it means that the quality flaws of DRK and BRT tend to co-occur despite their contradictory concepts. Nevertheless, the examples of such images in Figure 9 explain why this phenomenon happens. The main reason for this phenomenon is when blind people take pictures in places with poor lighting, they are not aware that the flashlights on mobile devices are turned on automatically, and therefore pictures taken are usually of dark surroundings and a bright spot. Note that this phenomenon is not captured by the joint probability of DRK and BRT, since is an extremely small value which does not manifest too much.
Co-occurrence of quality flaws.
We exemplify the co-occurrence of other pairs of quality flaws in Figure 10.
8 Quality flaw prediction
Performance of quality flaw classification is shown in Table 4. We can tell that the Xception model outperforms the random guessing baseline for each quality flaw, with respect to precision, recall, and f1 score. Furthermore, Xception predicts much better in NON, BLR and FRM flaws for large portions of the dataset. On the other hand, quality flaws that represent small portions of the dataset are prone to few-shot learning, and so learning to predict them is harder. In the extreme case of OTH, with it representing of the data, the Xception model yields very poor scores of and for recall and f1 score, respectively.
Figure 8 illustrates the diversity of unrecognizable images that can arise from different quality flaws.
Figure 11 shows a screen shot of the crowdsourcing interface used to collect the labels for the dataset.
Figure 12 shows the examples of unrecognizability prediction by the Xception model.
Figure 13 shows the examples of the prediction of the reason for unanswerable questions. The prediction model used is “TD+sigmoid” model.
10 Section 4.3: Clarification about Baselines
The two baselines, “random flag” and “perfect flag”, use the same number of images from the captioning-training-set as our method for algorithm training. That count is determined by our predictor, specifically the number of images that remain after removing all images that are deemed to be unrecognizable. “Random flag” chooses a random sample from the captioning-training-set. “Perfect flag” chooses images based on a ranking of images based on how many crowdworkers flag the images as unrecognizable, with selection starting from those where all five crowdworkers agreed the image is unrecognizable.