GRIT: General Robust Image Task Benchmark

04/28/2022
by   Tanmay Gupta, et al.
0

Computer vision models excel at making predictions when the test distribution closely resembles the training distribution. Such models have yet to match the ability of biological vision to learn from multiple sources and generalize to new data sources and tasks. To facilitate the development and evaluation of more general vision systems, we introduce the General Robust Image Task (GRIT) benchmark. GRIT evaluates the performance, robustness, and calibration of a vision system across a variety of image prediction tasks, concepts, and data sources. The seven tasks in GRIT are selected to cover a range of visual skills: object categorization, object localization, referring expression grounding, visual question answering, segmentation, human keypoint detection, and surface normal estimation. GRIT is carefully designed to enable the evaluation of robustness under image perturbations, image source distribution shift, and concept distribution shift. By providing a unified platform for thorough assessment of skills and concepts learned by a vision model, we hope GRIT catalyzes the development of performant and robust general purpose vision systems.

READ FULL TEXT VIEW PDF
02/04/2022

Webly Supervised Concept Expansion for General Purpose Vision Models

General purpose vision (GPV) systems are models that are designed to sol...
04/01/2021

Towards General Purpose Vision Systems

A special purpose learning system assumes knowledge of admissible tasks ...
07/19/2021

Separating Skills and Concepts for Novel Visual Question Answering

Generalization to out-of-distribution data has been a problem for Visual...
10/07/2020

Vision Skills Needed to Answer Visual Questions

The task of answering questions about images has garnered attention as a...
04/29/2020

The Effect of Natural Distribution Shift on Question Answering Models

We build four new test sets for the Stanford Question Answering Dataset ...
06/05/2019

MNIST-C: A Robustness Benchmark for Computer Vision

We introduce the MNIST-C dataset, a comprehensive suite of 15 corruption...
08/30/2020

An evolutionary perspective on the design of neuromorphic shape filters

A substantial amount of time and energy has been invested to develop mac...

Code Repositories

grit_official

Official repository for the General Robust Image Task (GRIT) Benchmark


view repo

1 Introduction

How do we know if a person truly understands a visual concept like “sheep”? At the very least, we expect the person to be able to apply their conceptual understanding to a wide range of skills - identify something as a sheep, locate sheep, segment the pixels of a sheep, localize a referred-to sheep from a flock, answer simple questions about sheep, and so on. We also expect the individual to be equally capable at applying these skills on social media images, web cam footage, stock photos, and other sources. In spite of tremendous progress in computer vision, the flexibility and generality of vision systems still falls well short of this human ability. Our state-of-the-art classification systems cannot segment objects and our best segmentation models cannot answer simple questions about objects for which they are so adept at identifying pixels. Furthermore, most vision models are trained and evaluated on a limited set of concepts and under a strong i.i.d. assumption, meaning that the images and annotations in training and test sets follow the same distribution. One of the barriers to developing more flexible and general computer vision systems is the lack of a standard methodology and benchmark for evaluating performance under distribution shifts across multiple tasks and a diverse set of concepts.

The General Robust Image Task Benchmark (GRIT) evaluates the performance and robustness of a vision system across a variety of image prediction tasks, concepts, and data sources. GRIT includes seven tasks: object categorization, object localization, referring expressions, visual question answering, semantic segmentation, human keypoint estimation, and surface normal estimation. These tasks are selected to cover a range of visual skills, and evaluation includes ability to make predictions for concepts that are learned from other data sources or tasks, robustness to image perturbations, and calibration measures.

Specifically, GRIT fulfills the following needs in development and evaluation of general robust vision systems:

  • GRIT is a unified platform for assessing the overall capability of computer vision systems in terms of 7 core competencies across a wide range of concepts and data sources - similar to GLUE [35]

    for natural language processing systems.

  • GRIT tests generalization to new data sources and concepts. In contrast, most existing benchmarks only evaluate in i.i.d. settings. In the GRIT Restricted track, the model is tested on data sources and concepts that are not allowed to be seen during training. To perform tasks with novel concepts the model must transfer skills across concepts [12].

  • GRIT tests robustness to image perturbations. Performance on each task is evaluated on a set of samples with and without 20 different types of image distortions of varying intensities such as JPEG compression, motion blur, Gaussian noise, and universal adversarial perturbation [15, 28].

  • GRIT simultaneously supports the development of large scale foundation models for vision and the fair comparison between models with limited compute resources. This is accomplished by GRIT’s two tracks: Restricted and Unrestricted. The Unrestricted track supports the development of large models with limited restrictions on allowed training data (any data source except those used to create the GRIT ablation and test sets). The Restricted track allows researchers to focus on skill-concept transfer and efficient learning given a rich but restricted set of training data sources. By limiting the training data to the selected publicly available sources, the Restricted track levels the playing field for researchers with respect to compute resource requirements and access to the same data sources.

2 Dataset Design Principles

All design decisions involved in creating GRIT are motivated by the following design principles:

  • Unambiguous Tasks. Following GLUE [35], we select vision and vision-language tasks with a clear task definition and unambiguous ground truth. We exclude captioning, for instance, as there may be many ways to caption an image. While we include the VQA task, which also has multiple possible answers, we follow the evaluation strategy used by the VQA benchmarks [2, 11] to eliminate ambiguity to the extent possible by including multiple answer options and answer text normalization. Additionally, we select questions with high answer consensus among annotators.

  • Generality and Robustness Tests: Each task includes evaluation samples that come from different data sources, include concepts that are not present in the training data for the task, and contain image perturbations. This measures transfer ability across data sources and concepts, and robustness to image distortions.

  • Concept Diversity and Balance: Task samples are selected to cover a wide and equally distributed range of concepts. Objects (noun concepts) have further been grouped into 24 concept groups (e.g. animals, food, tools).

  • Per-Sample Evaluation: All metrics are computed at a sample-level so they can be averaged across various subsets of data (e.g. samples from novel sources or those containing novel concepts) to summarize performance.

  • Knowledge Assessment and Calibration: Models are required to predict a confidence score for each prediction which is used to assess model’s knowledge, degree of misinformation, and calibration of beliefs.

  • Use Existing Datasets

    : When possible, we source tasks from existing, well established datasets to ensure that annotations and tasks are vetted. We also opt for hidden or even unused annotations from the selected sources. For e.g. we use COCO test-reserve annotations which are neither public nor used in any previous COCO or VQA challenges (VQA v2 

    [11] is based on COCO images).

  • Level Playing Field: A Restricted track with a fixed set of publicly available training data sources allows fair comparison across submissions and enables researchers with limited access to data and compute resources to participate and contribute novel, robust, and efficient learning methods.

  • Encourage Unified Models: We require all submissions to include total parameter count of the models used. While participants are allowed to use completely separate models for different tasks, we encourage models that share parameters across tasks and thus have a lower parameter count. It also serves as a simple albeit imperfect measure of compute and sample efficiency.

3 Task Overview

Figure 2: Inputs and ground truth task outputs for each of the 7 task in GRIT. For the categorization task, instead of an input query, we provide a list of categories to choose from.
samples concepts
subset task images total novel source novel concept distorted nouns grouped nouns
ablation categorization 12954 16839 12478 10740 385 745 727
localization 17457 21078 17193 15112 385 989 953
vqa 16017 21166 3565 713 385 9292 5488
refexp 4698 10525 2781 935 385 4899 3122
segmentation 10366 12745 8405 6977 385 695 680
keypoint 5385 5385 2529 0 385 1 1
normal 1786 1786 675 0 385 0 0
test categorization 13076 16841 12504 10755 385 766 752
localization 17380 21080 17289 15230 385 992 952
vqa 15988 21166 3624 684 385 9259 5441
refexp 4724 10526 2865 1008 385 5040 3194
segmentation 10420 12730 8385 6973 385 713 690
keypoint 5385 5385 2471 0 385 1 1
normal 1787 1787 679 0 385 0 0
Table 1: Number of images, samples, and concepts per task in GRIT.

For each task, the system should produce both an answer (text, boxes, segmentation masks, or surface normal maps) and a confidence (score between 0 to 1) reflecting the model’s belief in the correctness of the answer. A correctness score (0 to 1) is calculated by comparing the model’s predicted answer to the ground truth annotation. Ideally, for a well calibrated model, the confidence would equal the correctness score. We now describe each task and the corresponding correctness score.

Object categorization

identifies which label, from a given set, best corresponds to an image region defined by an input image and bounding box. Different from many benchmarks, such as ImageNet 

[9], the objects are typically depicted as part of a scene, and the bounding box indicates which object is of interest. A set of mutually exclusive categories representing possible answer options is provided as input to remove ambiguity. The correctness score is 1 if the prediction matches the ground truth label and 0 otherwise.

Object localization places a bounding box around each instance of a given object category in the input image (or none if no objects in the target class are present). To calculate the correctness score, predicted boxes are assigned to the ground truth boxes using a Hungarian matching algorithm. Each predicted box can be assigned to a single ground truth box or not assigned at all, and at most one predicted box is assigned to each ground truth box. The correctness score is defined as

(1)

where is the intersection over union of the matched pair of ground truth and predicted boxes out of matched pairs with a non-zero IoU. () is the number of ground truth boxes that are not assigned to any prediction. Approximately 30% of the images will contain zero target class objects. If there are no ground truth boxes, the score is 1 if there are no predictions and 0 if there are any predictions.

Segmentation identifies which pixels in the input image belong to a given category. Each task specifies one class that should be segmented. The task can be instance level (e.g. “toilet”) or stuff-level (“wall (Stuff)”). A large majority of samples are for segmenting object instances. Instance-level segmentation expects a pixel mask for each instance in the input image and assigns these predictions to the ground truth using a Hungarian matching algorithm. Stuff-level segmentation expects a single mask for the entire image; if multiple predicted instances are provided, we use the union of these masks as the prediction during scoring. We evaluate using Boundary IoU [6], which measures the intersection over union areas of the dilated boundaries of the prediction and ground truth. Boundary IoU requires more accurate segmentation of large objects without further penalizing predictions of very small objects. The dilation parameter is of the image diagonal. The final correctness score, similar to localization, is

(2)
ablation test
task source novel source images samples nouns grouped nouns images samples nouns grouped nouns
categorization COCO [25] 3804 4361 80 80 3780 4337 80 80
Open Images v6 [23] 8542 10014 475 469 8677 10154 479 472
NYU v2 [33] 608 2464 351 338 619 2350 369 361
localization COCO [25] 3642 3885 80 80 3541 3791 80 80
Open Images v6 [23] 13183 14422 602 584 13208 14435 602 584
NYU v2 [33] 632 2771 511 491 631 2854 519 495
vqa VQA v2 [11] 14061 17601 8038 4691 14051 17542 7914 4598
DAQUAR [26] 497 1079 651 500 508 1127 672 499
DCE-VQA [20] 1459 2486 2424 1682 1429 2497 2505 1715
refexp RefCOCO+ [21] 1492 3748 1904 1254 1482 3611 1912 1244
RefCOCOg [27] 2211 3996 2757 1908 2233 4050 2826 1944
RefCLEF [21] 1080 2781 1359 832 1099 2865 1410 843
segmentation COCO [25] 4019 4340 80 80 4024 4345 80 80
NYU v2 [33] 621 2403 417 406 623 2337 429 410
Open Images v6 [23] 5726 6002 328 323 5773 6048 335 330
keypoint COCO [25] 2856 2856 1 1 2914 2914 1 1
Construction [31, 36] 2529 2529 1 1 2471 2471 1 1
normal NYU v2 [33] 331 331 0 0 323 323 0 0
BlendedMVS [37] 338 338 0 0 368 368 0 0
ScanNet [7, 17] 773 773 0 0 740 740 0 0
DTU [19] 344 344 0 0 356 356 0 0
Table 2: Number of images, samples, and concepts for each data source in GRIT.

Referring expressions places a bounding box around the instance corresponding to the provided description and image. There is always one ground truth bounding box, so only one bounding box should be predicted. This task evaluates ability to interpret relationships and attributes received in natural language, as well as localization and categorization. The score is if between the predicted box and ground truth box. Otherwise, the score is . If more than one box is predicted, we use the first box in the prediction list to compute correctness and ignore the rest.

Visual question answering responds with a natural language answer to an image and natural language question. Often, as for the VQA dataset [11], multiple ground truth answers are available. If at least three answers match the prediction, the score is 1. Otherwise, the score is the number of matches divided by three. For example, if the annotated answers to “What color is the dog?” are {brown, brown, black, brown, brown}, then “brown” scores 1, “black” scores 1/3, and “white” scores 0. Questions from the original datasets were filtered to select those with at least some threshold level of ground truth annotation consistency. We use the implementation from the VQAv2 [11] benchmark which normalizes the answers through word contractions and removes articles and punctuation before computing the correctness score.

Person keypoint detection predicts pixel positions for the 17 keypoints of a human body. The correctness score is the average of keypoint scores (OKS) computed for each ground truth person instance in the image. As defined for the COCO challenge [25], and described in the challenge text: . The are the Euclidean distances between each corresponding ground truth and detected keypoint, and the are the visibility flags (0 is not labeled, 1 is labeled but not visible, 2 is labeled and visible) of the ground truth. To compute OKS, we pass the

through an unnormalized Guassian with standard deviation

, where is the object scale and is a per-keypoint constant that controls falloff. For each keypoint this yields a keypoint similarity that ranges between 0 and 1. These similarities are averaged over all labeled keypoints (). Predicted keypoints that are not labeled () do not affect the OKS. Perfect predictions will have , and predictions for which all keypoints are off by more than a few standard deviations will have . The predicted person instances are assigned to the ground truth person instances in a manner similar to the localization and segmentation tasks; OKS is computed for each combination assigning a prediction to a ground truth and the Hungarian matching algorithm selects the assignment which maximizes the total correctness score.

Surface normal prediction provides 3D surface normal directions for each pixel. Surface normal prediction and normalized depth prediction are popular geometric computer vision tasks. We choose surface normal prediction over depth regression because surface normals describe the shape of an object or surface, regardless to the distance to the camera. Additionally, we choose to have participants predict surface normals directly rather than computing them from depth because small errors in depth predictions may lead to large errors in surface normal estimates. However, we compute ground truth surface normals based on ground truth depth images, which are accurate enough to produce reasonable normals. Past works approach surface normal estimation in a scene-centric way, e.g. by predicting labels of horizontal and vertical surfaces [16] or major layout components [14], or in a view-sensitive way by directly predicting the normal relative to the camera [10]. Both representations are useful depending on whether the goal is a stable representation for physical reasoning or camera-sensitive reference frame for grasping. The correctness score accommodates both by adjusting for global orientation before computing the percent of predicted normals within 11.25 degrees of ground truth normals. The adjustment is obtained by solving for the rotation matrix that most closely aligns the predicted normals with ground truth normals. This way, normals can be predicted either from the camera’s viewpoint or a scene coordinate system.

4 Challenge

Data Source Split Annotations
COCO [25] Train & Val Bounding Boxes, Object Labels, Segmentation Masks, Pose Keypoints, Captions
RefCOCO+ [21] Train & Val (UNC) Referring Expressions, Bounding Boxes
VQA v2 [11] Train & Val Questions and Answers
BlendedMVS [37] Train (Our) Surface Normals
ScanNet [7, 17] Train Surface Normals
ImageNet [9] Any Object Labels
Conceptual Captions [32] Any Captions
VisualGenome [22] Any Bounding Boxes, Object, Attribute, Relationship Labels (VQA annotations not allowed)
Web10K [20] Any Queries
Table 3: GRIT Restricted training data. To participate in the Restricted Challenge, participants must only

choose training data from one or more of the above data sources. For each source, the table specifies the splits and annotations that may be used. For BlendedMVS, we provide our own train split, where train is a subset of BlendedMVS train, and ablation/test are a subsets of BlendedMVS validation. Other BlendedMVS train data may be used but may be less relevant (e.g. aerial photos). Use of language models and training based on any purely non-visual data, such as Book Corpus 

[39], is also allowed.
Data Source Split Annotations
COCO [25] Test-Reserve Any
Open Images v6 [23] Test Any
NYU v2 [33] Test Any
RefCOCO+/g, RefClef  [27, 21] Test (UNC) Referring Expressions, Bounding Boxes
VQA v2 [11] Test-Reserve Questions and Answers
DAQUAR [26] Test Questions and Answers
VisualGenome [22] Test (BUTD [3]) Questions and Answers
BlendedMVS [37] Test (Our) Surface Normals
ScanNet [7, 17] Test Surface Normals
DTU [19] Test Surface Normals
Construction [31, 36] Test (Our) Pose Keypoints
Table 4: GRIT ablation and test sources. This table lists the splits and annotations that that were used to create the GRIT ablation and test sets. Therefore, submissions to either the Restricted or the Unrestricted challenges may not use these splits and annotations for training.

Train Set: The GRIT challenge consists of two tracks depending on the allowed training data. In the Restricted track, participants must only use one or more of the data sources and annotations listed in Tab. 3

for model tuning and hyperparameter selection. In the

Unrestricted track, any data source may be used except the sources and annotations listed in Tab. 4

that were used to create the GRIT ablation and test sets. Even unsupervised learning is not allowed on the excluded sources.

Ablation and Test Sets: Images and task inputs are provided for the ablation and test sets. Models must not use any direct information about the image source (e.g by identifying data source from the image ids and feeding it as an input to the model), and the ablation and test images should not be used in training or parameter selection in any way. Task-level and dataset-level statistics for GRIT ablation and test sets are shown in Tab. 1 and Tab. 2.

Leaderboards: GRIT has an ablation and a test leaderboard per track. Participants may obtain ablation scores unlimited number of times but test scores once in a 7 day period. The test leaderboard should be used only for the final results of the main system. Submissions to both ablation and test are private until made public by the participant. Test scores are hidden unless made public.

5 Concepts

To enable evaluation of core computer vision competencies across a wide range of concepts, GRIT aims to maximize the coverage of concepts while preventing over-representation of any concept. to achieve this, we first tag each sample with concepts present in the input or output text. Then, we follow a concept-based sampling strategy that for each concept identified in the source dataset, includes at least one sample containing the concept. We further cap the maximum number of samples per concept unless we encounter a sample where the concept co-occurs with another under-represented concept. A large number of concepts in GRIT are grouped into higher-level concept groups (Tab. 5) that may be of interest to various application domains such as “food”, “clothing”, “animals” etc. GRIT summarizes performance of vision systems on 24 such application domains for each task (except the keypoint task which is limited to “people” and the surface normal task which does not have any tagged concepts).

Concept Tagging. Each sample in GRIT is tagged with a set of concepts (nouns, adjectives, and verbs) that appear in the task query (an object category, a VQA question, or a referring expression) or the ground truth output text (an object category or VQA answers). To tag concepts in any text, we first tokenize the text and tag each token with a POS tag. Next, we combine any consecutive noun-noun tokens (e.g. dinner table) and consecutive adj-noun tokens (e.g. hot dog) with a high () normalized pointwise mutual information (NPMI) into compound nouns. We use the unigram and bigram frequencies from BERT’s [8] training corpus (BookCorpus [40] and Wikipedia) to compute NPMI. All nouns, compound nouns, adjectives, and verbs are collected as concept tags. Each tag consists of the original and lemmatized text as well as one of 4 tags - NOUN, CNOUN, ADJ, and VERB.

Concept Grouping. A large number of noun and compound nouns in GRIT are grouped into at least one of 24 concept groups (Tab. 5). We used Amazon Mechanical Turk (AMT) to map lemmas to concept groups. Specifically, for any lemma that appears more than once in GRIT  we ask 3 workers whether “ belongs to the group of ” for every concept group . We then compute an agreement score among the 3 workers for the hypothesis as the sum of worker-quality weighted binary assignment score. The worker-quality is computed as the fraction of annotations where the worker’s answer matched the majority-vote answer. The assignment is accepted if the agreement score exceeds a chosen threshold. Lemmas that appear only once in GRIT and whose head nouns have already been annotated through AMT, borrow the head noun’s assignments. Remaining single-occurrence lemmas are annotated via AMT.

concept group #concepts concept lemmas (sampled)
food 1278 artichoke, lamb, powdered sugar, cake, common fig, banana, hot dog, pizza, peach, rice
people 1056 man, person, driver, skateboarder, guy, race, boy, sweater guy, snowboarder, cook
places 1005 bar, side, computer room, building, basement, convenience store, lighthouse, hill, pet store, porch
kitchen_objects 961 sink, spoon, bottle, bowl, platter, wine bottle, fridge, stove, water bottle, cabinet
animals 847 alpaca, food, animal, cat, dog, clydesdale, fox, dragonfly, bee, beak
clothing 756 color clothe, tie, jacket, shoe, shirt, jean, top, maroon shirt, black tshirt, coat
structure 703 room, wall, fence, socket, fountain, road, awning, sign, pool, construction area
vehicles 680 plane, bus, airplane, motorcycle, bike, bicycle, old car, transportation, barge, submarine
household_objects 668 clock, luggage, picture, pillow, scissor, vase, photo, telephone, bottom plant, light bulb
technology 522 turbine, iphone, water purifier, information, mobile phone, microwave, remote, mouse, TV screen, apple
sports_equipment 514 kite, bicycle, football, bicycle wheel, new jersey, frisbee, volleyball, cricket ball, tennis racket, oar
body_parts 493 boys foot, beard, nose, jersey head, hand, chest, head, ear, tusk, eye
clothing_accessories 481 backpack, bandana, necktie, sneaker, purse, hat, handbag, glove, high heel, mask
furniture 468 metal chair, chair, bed, tile counter, table, wardrobe, couch, counter, stool, table square
natural_landscape 439 sand, grass, water, apple, land, wood, river, cairngorm, tree, maple
transport_infrastructure 366 left, fire hydrant, parking meter, stop sign, road, mosco street, traffic light, dock, bus route, ski lift
bathroom_objects 359 toilet, toothbrush, floor mat, lipstick, restroom, cabinet, plastic object, towel rod, shower, sponge
brands 332 brand name, ford, ipad, ipod, dell, sign advertising, ducati, camel, adida, tarmak
tools 295 rope, white utensil, controller, rifle, scale, object, box, drill, personal flotation, cart
plants 254 grass, hay, tree bush, red flower, flower, ornamental plant, straw, floret, stick, tree
stationery 251 folder, letter, office supply, object, paper, ruler, tag, whiteboard, paper holder, book
beverages 206 juice, water, milk, coffee, liquid, drink, healthy meal, beer, cocktail, wine
birds 140 bat, bird, hummingbird, raven, goose, sparrow, egg, swan, chicken, penguin
musical_instruments 75 wind, keyboard, guitar, organ, piano, flute, violin, scale, saxophone, horn
Table 5: Concept groups with number of unique concepts and 10 random concepts from each group sampled in proportion to the frequency of occurrence in GRIT. While generally accurate, a few errors in identifying concepts and concept lemma to group mapping stem from inaccurate POS tagging, incorrect assignment by AMT workers (e.g. “left” is mapped to “transport_infrastructure”), or the automatic mapping of single-occurrence lemmas using head nouns (e.g. “jersey head”, a type of head wrap, is mapped to “body_parts” instead of “clothing”).

Sampling. Instead of sampling uniformly at random from each data source, we follow a per-concept sampling strategy that maximizes the number of concepts represented in GRIT while also preventing any concept from dominating the evaluation. Specifically, we iterate over the noun concepts in increasing order of frequency in the original dataset and sample a fixed number of samples, say N, for each concept without replacement. Due to co-occurrence of concepts in samples, samples may already have been sampled for the concept while sampling other less frequent concepts. Therefore, we only sample an additional min samples from the remaining samples for the concept. The number of remaining samples and the number of selected samples for each concept are updated every iteration after selecting the concept samples. Fig. 3 contrasts per-concept sampling with random sampling when selecting GRIT samples from VQAv2. The overall concept frequency distribution in the GRIT ablation set resulting from the per-concept sampling strategy is shown in Tab. 6.

task 0-5 6-15 16-25 26-50 51-100 101-500 500
categorization 204 84 213 170 69 5 0
localization 316 39 304 252 74 4 0
vqa 7884 729 228 261 114 71 5
refexp 4402 260 84 75 57 20 1
segmentation 283 91 172 80 69 0 0
keypoint 0 0 0 0 0 0 1
normal 0 0 0 0 0 0 0
Table 6: Concept frequency distribution. Each column labeled - indicates the number of NOUN or CNOUN concepts that appear at least and at most times for different task in the GRIT ablation set.
Figure 3: Effect of per-concept sampling on VQAv2 dataset. In the plot, each point corresponds to a concept. With , per-concept sampling generally selects more samples for concepts that would have been represented less than 50 times under random sampling while reduces the representation of concepts that appear more than 50 times during random sampling.

6 Metrics

Metrics for GRIT comprise of accuracy, knowledge, and calibration measures computed on various subsets of data. Specifically, we follow a 4 part nomenclature for metrics - measure.cgroup.partition.task.

  • measure: one of the performance measures listed in Tab. 7 and described in Sec. 6.2.

  • cgroup: specifies the concept group to measure performance over. GRIT consists of 24 concept groups (e.g. “plants”, “animals”, “tools”, “furniture” etc.) and cgroup“any” implies no concept group restriction.

  • partition: restricts metric computation to samples with a specific property (e.g. samples containing distorted images). GRIT consists of 7 partitions described in Tab. 8.

  • task: one of the 7 tasks or “all” for overall performance across all tasks.

6.1 Robustness

GRIT allows evaluation of robustness under distribution shifts due to the following factors:

  • Change in image source: performance may be compared across samples with the same or different image source than training data by setting:
    partition{sameSrc, newSrc}

  • Change in concept distribution: performance may be compared across samples that only contain concepts seen during training or ones containing at least one novel concept by setting:
    partition{sameCpt, newCpt}

  • Image perturbation: a range of image perturbations of various types and intensities have been applied to a subset of 385 samples from each task. Specifically, we apply each of the 19 types of distortion from [15] at five severity levels to 3 samples per task, and apply universal adversarial perturbations from [29] at five severity levels to 20 samples per task. Performance may be compared on this subset for samples with and without distortions by setting:
    partitiondist, undist}

Measure Abbrev. Purpose Function
Accuracy* acc Correctness
Information inf Confident correctness
Misinformation misinf Confident incorrectness
Confidence conf Actionability
Self-Awareness sa Calibration
RMSE* rmse Calibration
Table 7: Knowledge and calibration measures computed over a set of samples indexed by with and as the accuracy (correctness score) and confidence of the prediction for the sample. Confidence and correctness score are assumed to be in the range of 0 to 1. Tab. 8 defines the sets over which the measures are computed. * highlights the main measures to be used for model comparison on GRIT.
Partition Abbrev. Description
Same source sameSrc Samples that share the same image source as primary training data
New source* newSrc Samples that use an image source different from primary training data
Aggregate* agg Average of sameSrc and newSrc performance
Same concept sameCpt Samples that only contain concepts seen in primary training data
New concept newCpt Samples containing at least one concept unseen in primary training data
Distorted dist Samples containing distorted or perturbed images
Undistorted undist Samples containing corresponding undistorted images
Delta distorted deldist Distorted and undistorted sample pairs to compute change in measures due to distortion
Table 8: Partitions define various subsets of samples over which measures shown in Tab. 7 are aggregated. * indicates the main partitions to use for model comparison for GRIT.

6.2 Knowledge and Calibration

Typically, vision and vision-language benchmarks encourage the research community to build more accurate models. However, models are known to be confidently wrong [24] with potentially undesirable consequences. Further to evaluate knowledge, it is important to consider the model’s certainty in its beliefs in addition to whether those beliefs are correct [18]. To encourage models that are both accurate and well-calibrated, in addition to accuracy, GRIT includes the following measures of model knowledge and calibration:

  • Confidence: AI-driven human decision making often hinges on the model’s reported confidence in its own predictions. A low confidence prediction may be ignored by the decision maker as unreliable while a high confidence prediction is likely to be trusted and used in the decision making process. We report average confidence as a measure of a model’s usable or actionable beliefs. These are beliefs that may be used in the decision making process irrespective of whether those beliefs are correct.

  • Information: confidence weighted correctness score that indicates how often the model is both confident and correct and hence well-informed.

  • Misinformation: confidence weighted complement of correctness score that indicates how often the model is confidently wrong or misinformed. This reflects the portion of model’s beliefs with potentially negative and harmful consequences with regards to fairness and operational risk (e.g in autonomous driving applications).

  • Self-Awareness: measure of whether a model knows what it knows and doesn’t know.

  • RMSE: root mean square error between confidence and correctness score that indicates whether the predicted confidence is a reliable proxy for accuracy.

6.3 Recommended metrics

While GRIT provides a number of informative metrics for model performance analysis, for consistency, we recommend the following metrics for comparing models on individual tasks on GRIT:

  • {acc, rmse}.any.{agg, newSrc}.{task}: accuracy and RMSE computed on the “Aggregate” and “New source” partitions for each task

  • acc.any.dist.{task}: accuracy on the distorted samples for each task

For comparing models on overall performance across all tasks, we recommend:

  • acc.any.agg.all: average accuracy over all combinations of {task}{sameSrc,newSrc} as the primary metric

  • acc.any.dist.all: average accuracy on distorted samples across tasks

Note that these overall measures assume accuracy on tasks not supported by the model.

7 Baselines

cat loc vqa ref seg kp sn
Model restricted agg newSrc agg newSrc agg newSrc agg newSrc agg newSrc agg newSrc agg newSrc
Bae et al. [4] - - - - - - - - - - - - 49.4 43.9
Mask-RCNN [13] - - 44.6 43.3 - - - - 26.2 8.2 70.8 72.1 - -
GPV-1 [12] 33.2 9.4 42.8 39.5 50.6 38.0 25.8 18.4 - - - - - -
GPV-2 [20] 54.8 22.9 53.5 54.7 63.5 53.3 51.5 39.4 - - - - - -
Table 9: Aggregate performance and generalization to new sources on GRIT ablation set. Same / new source partitions are true to their name only in the Restricted setting. agg: acc.any.agg.task; newSrc: acc.any.newSrc.task
cat loc vqa ref seg kp sn
Model restricted same new same new same new same new same new same new same new
Bae et al. [4] - - - - - - - - - - - - 49.6 -
Mask-RCNN [13] - - 51.9 40.8 - - - - 44.9 0.3 70.9 - - -
GPV-1 [12] 58.7 0.8 48.3 37.8 58.3 74.0 29.7 23.1 - - - - - -
GPV-2 [20] 84.9 13.5 54.6 54.2 69.8 81.7 57.8 48.3 - - - - - -
Table 10: Generalization to new concepts on GRIT ablation set. Same / new concept partitions are true to their name only in the Restricted setting. same: acc.any.sameCpt.task; new: acc.any.newCpt.task
cat loc vqa ref seg kp sn
Model restricted undist dist undist dist undist dist undist dist undist dist undist dist undist dist
Bae et al. [4] - - - - - - - - - - - - 54.3 42.6
Mask-RCNN [13] - - 47.4 20.6 - - - - 40.6 18.8 67.9 43.4 - -
GPV-1 [12] 58.4 29.1 45.6 32.2 65.2 57.1 30.1 33.0 - - - - - -
GPV-2 [20] 86.5 64.7 51.7 34.9 72.4 61.8 60.5 56.4 - - - - - -
Table 11: Robustness to image distortions on GRIT ablation set. For each task, performance is computed on the same set of images and model inputs with and without distortion. undist: acc.any.undist.task; dist: acc.any.dist.task
cat loc vqa ref seg kp sn
Model restricted rmse sa rmse sa rmse sa rmse sa rmse sa rmse sa rmse sa
Bae et al. [4] - - - - - - - - - - - - 55.5 49.4
Mask-RCNN [13] - - 49.1 66.8 - - - - 21.1 85.5 32.8 72.8 - -
GPV-1 [12] 58.0 49.5 50.1 51.5 49.0 60.8 65.7 44.1 - - - - - -
GPV-2 [20] 55.6 62.5 42.4 55.1 37.7 71.0 48.6 59.4 - - - - - -
Table 12: Calibration on GRIT ablation set. rmse: rmse.any.agg.task (lower is better); sa: sa.any.agg.task

We present a preliminary set of experiments to demonstrate various evaluation capabilities afforded by GRIT and to highlight avenues for future research. Since there is no single model that can perform all the GRIT tasks, we evaluate the following models to cover all tasks.

  1. GPV-1 [12] is a task-agnostic vision language architecture that can perform any task with text and image as inputs, and text, bounding boxes and region relevance scores as outputs. GPV-1 is trained on Coco for categorization, localization, VQA, and captioning tasks. We evaluate GPV-1 on the GRIT categorization, localization, and VQA tasks for which the model was trained using Coco (along with Coco captioning). We also evaluate GPV-1 on the referring expression task in a zero-shot setting. Confidence scores for categorization and vqa are the likelihood scores of the output text. For localization, only boxes with relevance greater than are selected as predictions with the average relevance being the prediction confidence. For referring expression, the most relevance box and its relevance score are used as prediction and confidence respectively.

  2. GPV-2 [20] is a GPV architecture based on the T5 [30] encoder-decoder architecture trained on multiple NLP tasks and uses region proposals and visual representations computed by the powerful VinVL [38] object detector. GPV-2 is trained to perform the same set of tasks as GPV-1 along with a classification-in-context task and the referring expressions task but for more than 10,000 concepts instead of just the 80 primary Coco categories. To do so, GPV-2 learns these concepts from web image search results for noun, adj-noun, and noun-verb queries and transfers the learned concepts across skills learned from task-specific Coco annotations. Confidence scores for GPV-2 are generated in a manner similar to GPV-1, but GPV-2 reuses the language decoder to score boxes.

  3. Mask-RCNN [13] is a well-tuned detection, segmentation, and keypoint prediction architecture that uses an anchor-box based region proposal network, and box, class, and segmentation heads to produce region-level outputs. Since. Mask-RCNN is limited to Coco categories, predicted instances in segmentation and localization are selected only if their label matches the query class. The confidence score is calculated as an average of selected instance confidence scores.

  4. Bae et al. [4] uses the estimated aleatoric uncertainty in surface normal prediction to guide the stochastic selection of pixels to use for training. Input images are resized , and the predicted normal map is resized back to the original input size and multiplied by -1 to match the ground truth coordinate frame. The confidence score is calcualted as the percentage of pixels in the uncertainty map with expected angular error less than . We use the publicly available model trained on ScanNet.

Except for GPV-2, all of the above models are eligible for submission to the Restricted track. GPV-2, however, can only be submitted to the Unrestricted track since it uses a VinVL backbone trained on Open Images v6 and Objects365 (along with Coco and Visual Genome) which are not part of the Restricted training set. We now discuss generality, robustness, and calibration of these models on the GRIT benchmark.

Generality. Tab. 9 shows accuracy of each model across supported tasks. Generally, performance drops on new sources across tasks for the Restricted models. On the keypoints task, model performance is higher on the new source since Construction dataset images are slightly biased towards simpler images focused on a single, clearly visible person. GPV-2 shows a similar drop in new source accuracy with the exception of localization. This is because in the Unrestricted track, models are allowed to train on the Open Images train set which is one of the novel sources. Note that “same” and “novel” are defined with respect to the Restricted training data and may not be applicable to the evaluation of generalization to novel sources in the Unrestricted setting.

Tab. 10 shows that Restricted models struggle to generalize to novel tasks, especially categorization and segmentation. Both of these tasks require predicting a category label never seen in the task’s training data. On VQA, the performance on the new concepts is significantly higher. However, note that VQA new concept evaluation is somewhat limited since VQAv2 training data covers a surprisingly large number of concepts resulting in only a small number of samples with novel concepts many of which are of the relatively high scoring Yes/No question type (see performance breakdown by question types in Tab.2 and 3 in  [11]). Finally, the VQA novel concepts are often bigrams (e.g “winter glove”, “restaurant pizza”) where one or both of the words is either superfluous or seen during training. Keypoint task is limited to a single concept (“person”) and surface normal samples are not tagged with concepts, and hence these tasks do not report performance on the new concept partition.

Robustness. Tab. 11 shows that all models show a drop in performance across all tasks due to image distortion with the exception of zero-shot GPV-1 on referring expressions task. Note that in this case, the performance on the undistorted images is already quite low as expected from a zero-shot model.

Calibration. Tab. 12 shows RMSE and self-awareness for each model and the supported tasks. RMSE measures how reliable is the model predicted confidence score as an estimate of model correctness. Self-awareness also encourages greater confidence scores for correct predictions and lower confidence for incorrect ones. A model may maximize self-awareness by predicting confidence when the correctness score is greater than and otherwise. However, this reduces the model’s RMSE. Therefore, the choice of calibration metric to optimize depends on how you plan to use the confidence scores in downstream decision making.

Model restricted params (M) acc.any.agg.all acc.any.dist.all
Mask-RCNN [13] 58 20.2 11.8
Bae et al. [4] 72 7.1 6.1
GPV-1 [12] 236 21.8 21.6
GPV-2 [20] 370 31.9 31.1
Table 13: Model size and overall performance averaged across all tasks on the GRIT ablation set. Accuracy is assumed to be for tasks not supported by the model.

Overall performance. Model parameter counts and overall performance for all baselines are shown in Tab. 13

8 Related Generalization Benchmarks

The Robust Vision Challenge [1] invites researchers to submit their model’s results to one or more of seven tasks (stereo, optical flow, single-view depth estimation, object detection, semantic/instance/panoptic segmentation) across 13 benchmark datasets . For any submitted task, the same model must be used to generate results on all applicable benchmarks. This tests a system’s ability to learn and perform on multiple data distributions simultaneously, while GRIT, in the Restricted track, further requires ability to generalize to new data and label distributions by prohibiting training on held out benchmarks. GRIT also includes language tasks (VQA, referring expressions) and excludes video and multiview tasks.

ObjectNet [5] is a test-only dataset that evaluates generalization of object classification to a new data source and broad pose variation. GRIT offers a broader set of tasks and stipulates a standard set of training data to enable comparison of in-distribution of new-distribution performance while controlling (optionally) for training data.

GLUE [35] (General Language Understanding Evaluation) provides a broad benchmark to test state-of-the-art approaches on natural language processing tasks. SuperGLUE [34] is a improved revision of GLUE that includes a set of more challenging language tasks. We adopt some design principles from GLUE, such as restricting to tasks with unambiguous ground truth and using existing datasets where possible. For pragmatic reasons, we do flex on these ideals, as for example questions in VQA can often legitimately be answered in multiple ways, and we create new benchmarks for person keypoint prediction and surface normal estimation to increase diversity of data sources.

9 Conclusion

In empirical computer vision and language understanding research, high quality benchmarks such as ImageNet, MSCOCO, and GLUE have driven progress over multiple years or even an entire decade. As a result, we now have highly performant, data-driven, task-specific models. The time is ripe to unify these advances into more general purpose systems that are flexible enough to perform a wide range of tasks without requiring architecture changes and robust enough to withstand the drafts of distribution shift that plague vision and vision-language models in the open world setting. GRIT hopes to drive the development of such models by providing a unified platform for evaluation of generality and robustness of 7 core capabilities of computer vision across multiple data sources and diverse concepts.

10 Acknowledgements

We thank Tsung-Yi Lin and the COCO team, Aishwarya Agrawal and the VQA team, Angela Dai and the Scannet team, and the Construction Keypoints team for valuable data contributions to GRIT. We are also thankful to Amita Kamath, Christopher Clark, Zhen Zhu, Michal Shlapentokh-Rothman, and Jiasen Lu for several discussions that helped shape GRIT. Many thanks to Yuqun Wu for surface normal processing and experiments. Finally, we are grateful to Michal Guerquin, Jon Borchardt, M Kusold, and Michael Schmitz from the AI2 Reviz team for providing incredible web-tools and engineering support for GRIT.

References

  • [1] Robust vision challenge. http://robustvision.net/. Accessed: 2022-04-19.
  • [2] Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. Vqa: Visual question answering. International Journal of Computer Vision, 123:4–31, 2015.
  • [3] Peter Anderson, X. He, C. Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang.

    Bottom-up and top-down attention for image captioning and visual question answering.

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pages 6077–6086, 2018.
  • [4] Gwangbin Bae, Ignas Budvytis, and Roberto Cipolla. Estimating and exploiting the aleatoric uncertainty in surface normal estimation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 13117–13126, 2021.
  • [5] Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In NeurIPS, 2019.
  • [6] Bowen Cheng, Ross B. Girshick, Piotr Dollár, Alexander C. Berg, and Alexander Kirillov. Boundary iou: Improving object-centric image segmentation evaluation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation / IEEE, 2021.
  • [7] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
  • [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina N. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  • [9] Wei Dong, Richard Socher, Li Li-Jia, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  • [10] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2650–2658. IEEE Computer Society, 2015.
  • [11] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR, 2017.
  • [12] Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem. Towards general purpose vision systems: An end-to-end task-agnostic vision-language architecture. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • [13] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask R-CNN. In ICCV, 2017.
  • [14] Varsha Hedau, Derek Hoiem, and David A. Forsyth. Recovering the spatial layout of cluttered rooms. In IEEE ICCV, 2009.
  • [15] Dan Hendrycks and Thomas Dietterich.

    Benchmarking neural network robustness to common corruptions and perturbations.

    Proceedings of the International Conference on Learning Representations, 2019.
  • [16] Derek Hoiem, Alexei A. Efros, and Martial Hebert. Recovering surface layout from an image. International Journal of Computer Vision, 75(1), Oct 2007.
  • [17] Jingwei Huang, Yichao Zhou, Thomas Funkhouser, and Leonidas Guibas. Framenet: Learning local canonical frames of 3d surfaces from a single rgb image. arXiv preprint arXiv:1903.12305, 2019.
  • [18] Darwin P. Hunt. The concept of knowledge and how to measure it. Journal of Intellectual Capital, 4:100–113, 2003.
  • [19] Rasmus Jensen, Anders Dahl, George Vogiatzis, Engil Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 406–413. IEEE, 2014.
  • [20] Amita Kamath, Christopher Clark, Tanmay Gupta, Eric Kolve, Derek Hoiem, and Aniruddha Kembhavi. Webly supervised concept expansion for general purpose vision models. ArXiv, abs/2202.02317, 2022.
  • [21] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  • [22] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016.
  • [23] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982, 2018.
  • [24] Zhizhong Li and Derek Hoiem. Improving confidence estimates for unfamiliar examples. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2683–2692, 2020.
  • [25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
  • [26] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1682–1690. Curran Associates, Inc., 2014.
  • [27] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
  • [28] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 86–94. IEEE Computer Society, 2017.
  • [29] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 86–94. IEEE Computer Society, 2017.
  • [30] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, W. Li, and Peter J. Liu.

    Exploring the limits of transfer learning with a unified text-to-text transformer.

    J. Mach. Learn. Res., 21:140:1–140:67, 2020.
  • [31] D. Roberts, M. Wang, W. Torres Calderon, and M. Golparvar-Fard.

    An annotation tool for benchmarking methods for automated construction worker pose estimation and activity analysis.

    In International Conference on Smart Infrastructure and Construction 2019, ICSIC 2019, International Conference on Smart Infrastructure and Construction 2019, ICSIC 2019: Driving Data-Informed Decision-Making, pages 307–313. ICE Publishing, 2019.
  • [32] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  • [33] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pages 746–760, 2012.
  • [34] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint 1905.00537, 2019.
  • [35] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics.
  • [36] Jun Yang, Zhongke Shi, and Ziyan Wu. Vision-based action recognition of construction workers using dense trajectories. Adv. Eng. Informatics, 30(3):327–336, 2016.
  • [37] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR), 2020.
  • [38] Pengchuan Zhang, Xiujun Li, X. Hu, Jianwei Yang, L. Zhang, Li-Juan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Making visual representations matter in vision-language models. ArXiv, abs/2101.00529, 2021.
  • [39] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [40] Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In ICCV, 2015.