LVIS: A Dataset for Large Vocabulary Instance Segmentation

08/08/2019 ∙ by Agrim Gupta, et al. ∙ 2

Progress on object detection is enabled by datasets that focus the research community's attention on open challenges. This process led us from simple images to complex scenes and from bounding boxes to segmentation masks. In this work, we introduce LVIS (pronounced `el-vis'): a new dataset for Large Vocabulary Instance Segmentation. We plan to collect 2 million high-quality instance segmentation masks for over 1000 entry-level object categories in 164k images. Due to the Zipfian distribution of categories in natural images, LVIS naturally has a long tail of categories with few training samples. Given that state-of-the-art deep learning methods for object detection perform poorly in the low-sample regime, we believe that our dataset poses an important and exciting new scientific challenge. LVIS is available at http://www.lvisdataset.org.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A central goal of computer vision is to endow algorithms with the ability to intelligently describe images. Object detection is a canonical image description task; it is intuitively appealing, useful in applications, and straightforward to benchmark in existing settings. The accuracy of object detectors has improved dramatically and new capabilities, such as predicting segmentation masks and 3D representations, have been developed. There are now exciting opportunities to push these methods towards new goals.

Today, rigorous evaluation of general purpose object detectors is mostly performed in the few category regime (e.g. 80) or when there are a large number of training examples per category (e.g. 100 to 1000+). Thus, there is an opportunity to enable research in the natural setting where there are a large number of categories and per-category data is sometimes scarce. The long tail of rare categories is inescapable; annotating more images simply uncovers previously unseen, rare categories (see Fig. 9 and [29, 25, 24, 27])

. Efficiently learning from few examples is a significant open problem in machine learning and computer vision, making this opportunity one of the most exciting from a scientific and practical perspective. But to open this area to empirical study, a suitable, high-quality dataset and benchmark is required.

Figure 1: Example annotations. We present LVIS, a new dataset for benchmarking Large Vocabulary Instance Segmentation in the 1000+ category regime with a challenging long tail of rare objects.

We aim to enable this new research direction by designing and collecting LVIS (pronounced ‘el-vis’)—a benchmark dataset for research on Large Vocabulary Instance Segmentation. We are collecting instance segmentation masks for more than 1000 entry-level object categories (see Fig. 1). When completed, we plan for our dataset to contain 164k images and 2 million high-quality instance masks.111We plan to annotate the 164k images in COCO 2017 (we have permission to label test2017); 2M is a projection after labeling 85k images. Our annotation pipeline starts from a set of images that were collected without prior knowledge of the categories that will be labeled in them. We engage annotators in an iterative object spotting process that uncovers the long tail of categories that naturally appears in the images and avoids using machine learning algorithms to automate data labeling.

We designed a crowdsourced annotation pipeline that enables the collection of our large-scale dataset while also yielding high-quality segmentation masks. Quality is important for future research because relatively coarse masks, such as those in the COCO dataset [18], limit the ability to differentiate algorithm-predicted mask quality beyond a certain, coarse point. When compared to expert annotators, our segmentation masks have higher overlap and boundary consistency than both COCO and ADE20K [28].

To build our dataset, we adopt an evaluation-first design principle. This principle states that we should first determine exactly how to perform quantitative evaluation and only then design and build a dataset collection pipeline to gather the data entailed by the evaluation. We select our benchmark task to be COCO-style instance segmentation and we use the same COCO-style average precision (AP) metric that averages over categories and different mask intersection over union (IoU) thresholds [19]. Task and metric continuity with COCO reduces barriers to entry.

Buried within this seemingly innocuous task choice are immediate technical challenges: How do we fairly evaluate detectors when one object can reasonably be labeled with multiple categories (see Fig. 2)? How do we make the annotation workload feasible when labeling 164k images with segmented objects from over 1000 categories?

The essential design choice resolving these challenges is to build a federated dataset: a single dataset that is formed by the union of a large number of smaller constituent datasets, each of which looks exactly like a traditional object detection dataset for a single category. Each small dataset provides the essential guarantee of exhaustive annotations for a single category—all instances of that category are annotated. Multiple constituent datasets may overlap and thus a single object within an image can be labeled with multiple categories. Furthermore, since the exhaustive annotation guarantee only holds within each small dataset, we do not require the entire federated dataset to be exhaustively annotated with all categories, which dramatically reduces the annotation workload. Crucially, at test time the membership of each image with respect to the constituent datasets is not known by the algorithm and thus it must make predictions as if all categories will be evaluated. The evaluation oracle evaluates each category fairly on its constituent dataset.

In the remainder of this paper, we summarize how our dataset and benchmark relate to prior work, provide details on the evaluation protocol, describe how we collected data, and then discuss results of the analysis of this data.

Dataset Timeline.

We report detailed analysis on the 5000 image val subset that we have annotated twice. We have now annotated an additional 77k images (split between train, val, and test), representing 50% of the final dataset; we refer to this as LVIS v0.5 (see §A for details). The first LVIS Challenge, based on v0.5, will be held at the COCO Workshop at ICCV 2019.

Figure 2: Category relationships from left to right: non-disjoint category pairs may be in partially overlapping, parent-child, or equivalent (synonym) relationships, implying that a single object may have multiple valid labels. The fair evaluation of an object detector must take the issue of multiple valid labels into account.
Figure 3: Example LVIS annotations (one category per image for clarity). See http://www.lvisdataset.org/explore.

1.1 Related Datasets

Datasets shape the technical problems researchers study and consequently the path of scientific discovery [17]. We owe much of our current success in image recognition to pioneering datasets such as MNIST [16], BSDS [20], Caltech 101 [6], PASCAL VOC [5]

, ImageNet 

[23], and COCO [18]. These datasets enabled the development of algorithms that detect edges, perform large-scale image classification, and localize objects by bounding boxes and segmentation masks. They were also used in the discovery of important ideas, such as Convolutional Networks [15, 13], Residual Networks [10]

, and Batch Normalization 

[11].

LVIS is inspired by these and other related datasets, including those focused on street scenes (Cityscapes [3] and Mapillary [22]) and pedestrians (Caltech Pedestrians [4]). We review the most closely related datasets below.

Coco [18]

is the most popular instance segmentation benchmark for common objects. It contains 80 categories that are pairwise distinct. There are a total of 118k training images, 5k validation images, and 41k test images. All 80 categories are exhaustively annotated in all images (ignoring annotation errors), leading to approximately 1.2 million instance segmentation masks. To establish continuity with COCO, we adopt the same instance segmentation task and AP metric, and we are also annotating all images from the COCO 2017 dataset. All 80 COCO categories can be mapped into our dataset. In addition to representing an order of magnitude more categories than COCO, our annotation pipeline leads to higher-quality segmentation masks that more closely follow object boundaries (see §4).

Ade20k [28]

is an ambitious effort to annotate almost every pixel in 25k images with object instance, ‘stuff’, and part segmentations. The dataset includes approximately 3000 named objects, stuff regions, and parts. Notably, ADE20K was annotated by a single expert annotator, which increases consistency but also limits dataset size. Due to the relatively small number of annotated images, most of the categories do not have enough data to allow for both training and evaluation. Consequently, the instance segmentation benchmark associated with ADE20K evaluates algorithms on the 100 most frequent categories. In contrast, our goal is to enable benchmarking of large vocabulary instance segmentation methods.

iNaturalist [26]

contains nearly 900k images annotated with bounding boxes for 5000 plant and animal species. Similar to our goals, iNaturalist emphasizes the importance of benchmarking classification and detection in the few example regime. Unlike our effort, iNaturalist does not include segmentation masks and is focussed on a different image and fine-grained category distribution; our category distribution emphasizes entry-level categories.

Open Images v4 [14]

is a large dataset of 1.9M images. The detection portion of the dataset includes 15M bounding boxes labeled with 600 object categories. The associated benchmark evaluates the 500 most frequent categories, all of which have over 100 training samples (

70% of them have over 1000 training samples). Thus, unlike our benchmark, low-shot learning is not integral to Open Images. Also different from our dataset is the use of machine learning algorithms to select which images will be annotated by using classifiers for the target categories. Our data collection process, in contrast, involves no machine learning algorithms and instead discovers the objects that appear within a given set of images. Starting with release v4, Open Images has used a federated dataset design for object detection.

2 Dataset Design

We followed an evaluation-first design principle: prior to any data collection, we precisely defined what task would be performed and how it would be evaluated. This principle is important because there are technical challenges that arise when evaluating detectors on a large vocabulary dataset that do not occur when there are few categories. These must be resolved first, because they have profound implications for the structure of the dataset, as we discuss next.

2.1 Task and Evaluation Overview

Task and Metric.

Our dataset benchmark is the instance segmentation task: given a fixed, known set of categories, design an algorithm that when presented with a previously unseen image will output a segmentation mask for each instance of each category that appears in the image along with the category label and a confidence score. Given the output of an algorithm over a set of images, we compute mask average precision (AP) using the definition and implementation from the COCO dataset [19] (for more detail see §2.3).

Evaluation Challenges.

Datasets like PASCAL VOC and COCO use manually selected categories that are pairwise disjoint: when annotating a car, there’s never any question if the object is instead a potted plant or a sofa. When increasing the number of categories, it is inevitable that other types of pairwise relationships will occur: (1) partially overlapping visual concepts; (2) parent-child relationships; and (3) perfect synonyms. See Fig. 2 for examples.

If these relations are not properly addressed, then the evaluation protocol will be unfair. For example, most toys are not deer and most deer are not toys, but a toy deer is both—if a detector outputs deer and the object is only labeled toy, the detection will be marked as wrong. Likewise, if a car is only labeled vehicle, and the algorithm outputs car, it will be incorrectly judged to be wrong. Or, if an object is only labeled backpack and the algorithm outputs the synonym rucksack, it will be incorrectly penalized. Providing a fair benchmark is important for accurately reflecting algorithm performance.

These problems occur when the ground-truth annotations are missing one or more true labels for an object. If an algorithm happens to predict one of these correct, but missing labels, it will be unfairly penalized. Now, if all objects are exhaustively and correctly labeled with all categories, then the problem is trivially solved. But correctly and exhaustively labeling 164k images each with 1000 categories is undesirable: it forces a binary judgement deciding if each category applies to each object; there will be many cases of genuine ambiguity and inter-annotator disagreement. Moreover, the annotation workload will be very large. Given these drawbacks, we describe our solution next.

2.2 Federated Datasets

Our key observation is that the desired evaluation protocol does not require us to exhaustively annotate all images with all categories. What is required instead is that for each category there must exist two disjoint subsets of the entire dataset for which the following guarantees hold:

Positive set: there exists a subset of images such that all instances of in are segmented. In other words, is exhaustively annotated for category .

Negative set: there exists a subset of images such that no instance of appears in any of these images.

Given these two subsets for a category , can be used to perform standard COCO-style AP evaluation for . The evaluation oracle only judges the algorithm on a category over the subset of images in which has been exhaustively annotated; if a detector reports a detection of category on an image , the detection is not evaluated.

By collecting the per-category sets into a single dataset, , we arrive at the concept of a federated dataset. A federated dataset is a dataset that is formed by the union of smaller constituent datasets, each of which looks exactly like a traditional object detection dataset for a single category. By not annotating all images with all categories, freedom is created to design an annotation process that avoids ambiguous cases and collects annotations only if there is sufficient inter-annotator agreement. At the same time, the workload can be dramatically reduced.

Finally, we note that positive set and negative set membership on the test split is not disclosed and therefore algorithms have no side information about what categories will be evaluated in each image. An algorithm thus must make its best prediction for all categories in each test image.

Figure 4: Our annotation pipeline comprises six stages. Stage 1: Object Spotting elicits annotators to mark a single instance of many different categories per image. This stage is iterative and causes annotators to discover a long tail of categories. Stage 2: Exhaustive Instance Marking extends the stage 1 annotations to cover all instances of each spotted category. Here we show additional instances of book. Stages 3 and 4: Instance Segmentation and Verification are repeated back and forth until 99% of all segmentations pass a quality check. Stage 5: Exhaustive Annotations Verification checks that all instances are in fact segmented and flags categories that are missing one or more instances. Stage 6: Negative Labels are assigned by verifying that a subset of categories do not appear in the image.

Reduced Workload.

Federated dataset design allows us to make . This choice dramatically reduces the workload and allows us to undersample the most frequent categories in order to avoid wasting annotation resources on them (e.g. person

accounts for 30% of COCO). Of our estimated

2 million instances, likely no single category will account for more than 3% of the total instances.

2.3 Evaluation Details

The challenge evaluation server will only return the overall AP, not per-category AP’s. We do this because: (1) it avoids leaking which categories are present in the test set;222It’s possible that the categories present in the val and test sets may be a strict subset of those in the train set; we use the standard COCO 2017 val and test splits and cannot guarantee that all categories present in the train images are also present in val and test.

(2) given that tail categories are rare, there will be few examples for evaluation in some cases, which makes per-category AP unstable; (3) by averaging over a large number of categories, the overall category-averaged AP has lower variance, making it a robust metric for ranking algorithms.

Non-Exhaustive Annotations.

We also collect an image-level boolean label, , indicating if image is exhaustively annotated for category . In most cases (91%), this flag is true, indicating that the annotations are indeed exhaustive. In the remaining cases, there is at least one instance in the image that is not annotated. Missing annotations often occur in ‘crowds’ where there are a large number of instances and delineating them is difficult. During evaluation, we do not count false positives for category on images that have set to false. We do measure recall on these images: the detector is expected to predict accurate segmentation masks for the labeled instances. Our strategy differs from other datasets that use a small maximum number of instances per image, per category (10-15) together with ‘crowd regions’ (COCO) or use a special ‘group of ’ label to represent 5 or more instances (Open Images v4). Our annotation pipeline (§3) attempts to collect segmentations for all instances in an image, regardless of count, and then checks if the labeling is in fact exhaustive. See Fig. 3.

Hierarchy.

During evaluation, we treat all categories the same; we do nothing special in the case of hierarchical relationships. To perform best, for each detected object , the detector should output the most specific correct category as well as all more general categories, e.g., a canoe should be labeled both canoe and boat. The detected object in image will be evaluated with respect to all labeled positive categories , which may be any subset of categories between the most specific and the most general.

Synonyms.

A federated dataset that separates synonyms into different categories is valid, but is unnecessarily fragmented (see Fig. 2, right). We avoid splitting synonyms into separate categories with WordNet [21]. Specifically, in LVIS each category is a WordNet synset—a word sense specified by a set of synonyms and a definition.

3 Dataset Construction

In this section we provide an overview of the annotation pipeline that we use to collect LVIS.

3.1 Annotation Pipeline

Fig. 4 illustrates our annotation pipeline by showing the output of each stage, which we describe below. For now, assume that we have a fixed category vocabulary . We will describe how the vocabulary was collected in §3.2.

Object Spotting, Stage 1.

The goals of the object spotting stage are to: (1) generate the positive set, , for each category and (2) elicit vocabulary recall such that many different object categories are included in the dataset.

Object spotting is an iterative process in which each image is visited a variable number of times. On the first visit, an annotator is asked to mark one object with a point and to name it with a category using an autocomplete text input. On each subsequent visit, all previously spotted objects are displayed and an annotator is asked to mark an object of a previously unmarked category or to skip the image if no more categories in can be spotted. When an image has been skipped 3 times, it will no longer be visited. The autocomplete is performed against the set of all synonyms, presented with their definitions; we internally map the selected word to its synset/category to resolve synonyms.

Obvious and salient objects are spotted early in this iterative process. As an image is visited more, less obvious objects are spotted, including incidental, non-salient ones. We run the spotting stage twice, and for each image we retain categories that were spotted in both runs. Thus two people must independently agree on a name in order for it to be included in the dataset; this increases naming consistency.

To summarize the output of stage 1: for each category in the vocabulary, we have a (possibly empty) set of images in which one object of that category is marked per image. This defines an initial positive set, , for each category .

Exhaustive Instance Marking, Stage 2.

The goals this stage are to: (1) verify stage 1 annotations and (2) take each image and mark all instances of in with a point.

In this stage, pairs from stage 1 are each sent to 5 annotators. They are asked to perform two steps. First, they are shown the definition of category and asked to verify if it describes the spotted object. Second, if it matches, then the annotators are asked to mark all other instances of the same category. If it does not match, there is no second step. To prevent frequent categories from dominating the dataset and to reduce the overall workload, we subsample frequent categories such that no positive set exceeds more than 1% of the images in the dataset.

To ensure annotation quality, we embed a ‘gold set’ within the pool of work. These are cases for which we know the correct ground-truth. We use the gold set to automatically evaluate the work quality of each annotator so that we can direct work towards more reliable annotators. We use 5 annotators per pair to help ensure instance-level recall.

To summarize, from stage 2 we have exhaustive instance spotting for each image for each category .

Instance Segmentation, Stage 3.

The goals of the instance segmentation stage are to: (1) verify the category for each marked object from stage 2 and (2) upgrade each marked object from a point annotation to a full segmentation mask.

To do this, each pair of image and marked object instance is presented to one annotator who is asked to verify that the category label for is correct and if it is correct, to draw a detailed segmentation mask for it (e.g. see Fig. 3).

We use a training task to establish our quality standards. Annotator quality is assessed with a gold set and by tracking their average vertex count per polygon. We use these metrics to assign work to reliable annotators.

In sum, from stage 3 we have for each image and spotted instance pair one segmentation mask (if it is not rejected).

Segment Verification, Stage 4.

The goal of the segment verification stage is to verify the quality of the segmentation masks from stage 3. We show each segmentation to up to 5 annotators and ask them to rate its quality using a rubric. If two or more annotators reject the mask, then we requeue the instance for stage 3 segmentation. Thus we only accept a segmentation if 4 annotators agree it is high-quality. Unreliable workers from stage 3 are not invited to judge segmentations in stage 4; we also use rejections rates from this stage to monitor annotator reliability. We iterate between stages 3 & 4 a total of four times, each time only re-annotating rejected instances.

To summarize the output of stage 4 (after iterating back and forth with stage 3): we have a high-quality segmentation mask for 99% of all marked objects.

Full Recall Verification, Stage 5.

The full recall verification stage finalizes the positive sets. The goal is to find images where is not exhaustively annotated. We do this by asking annotators if there are any unsegmented instances of category in . We ask up to 5 annotators and require at least 4 to agree that annotation is exhaustive. As soon as two believe it is not, we mark the exhaustive annotation flag as false. We use a gold set to maintain quality.

To summarize the output of stage 5: we have a boolean flag for each image indicating if category is exhaustively annotated in image . This finalizes the positive sets along with their instance segmentation annotations.

Negative Sets, Stage 6.

The final stage of the pipeline is to collect a negative set for each category in the vocabulary. We do this by randomly sampling images , where is all images in the dataset. For each sampled image , we ask up to 5 annotators if category appears in image . If any one annotator reports that it does, we reject the image. Otherwise is added to . We sample until the negative set reaches a target size of 1% of the images in the dataset. We use a gold set to maintain quality.

To summarize, from stage 6 we have a negative image set for each category such that the category does not appear in any of the images in .

3.2 Vocabulary Construction

We construct the vocabulary with an iterative process that starts from a large super-vocabulary and uses the object spotting process (stage 1) to winnow it down. We start from 8.8k synsets that were selected from WordNet by removing some obvious cases (e.g. proper nouns) and then finding the intersection with highly concrete common nouns [2]. This yields a high-recall set of concrete, and thus likely visual, entry-level synsets. We then apply object spotting to 10k COCO images with autocomplete against this super-vocabulary. This yields a reduced vocabulary with which we repeat the process once more. Finally, we perform minor manual editing. The resulting vocabulary contains 1723 synsets—the upper bound on the number of categories that can appear in LVIS.

Figure 5: Distribution of object centers in normalized image coordinates for four datasets. ADE20K exhibits the greatest spatial diversity, with LVIS achieving greater complexity than COCO and the Open Images v4 training set.3
(a) Distribution of category count per image. LVIS has a heavier tail than COCO and Open Images training set. ADE20K is the most uniform.
(b) The number of instances per category (on 5k images) reveals the long tail with few examples. Orange dots: categories in common with COCO.
(c) Relative segmentation mask size (square root of mask-area-divided-by-image-area) compared between LVIS, COCO, and ADE20K.
Figure 6: Dataset statistics. Best viewed digitally.
(a) LVIS segmentation quality measured by mask IoU between matched instances from two runs of our annotation pipeline. Masks from the runs are consistent with a dataset average IoU of 0.85.
(b) LVIS recognition quality measured by score given matched instances across two runs of our annotation pipeline. Category labeling is consistent with a dataset average score of 0.87.
(c) Illustration of mask IoU vs. boundary quality to provide intuition for interpreting Fig. (a)a (left) and Tab. (a)a (dataset annotations vs. expert annotators, below).
Figure 7: Annotation consistency using 5000 doubly annotated images from LVIS. Best viewed digitally.
mask IoU boundary quality
dataset comparison mean median mean median
COCO dataset vs. experts 0.83 0.87 0.88 0.91 0.77 0.82 0.79 0.88
expert1 vs. expert2 0.91 0.95 0.96 0.98 0.92 0.96 0.97 0.99
ADE20K dataset vs. experts 0.84 0.88 0.90 0.93 0.83 0.87 0.84 0.92
expert1 vs. expert2 0.90 0.94 0.95 0.97 0.90 0.95 0.99 1.00
LVIS dataset vs. experts 0.90 0.92 0.94 0.96 0.87 0.91 0.93 0.98
expert1 vs. expert2 0.93 0.96 0.96 0.98 0.91 0.96 0.97 1.00
(a)

For each metric (mask IoU, boundary quality) and each statistic (mean, median), we show a bootstrapped 95% confidence interval.

LVIS has the highest quality across all measures.
annotation boundary complexity
dataset source mean median
COCO dataset 5.59 6.04 5.13 5.51
experts 6.94 7.84 5.86 6.80
ADE20K dataset 6.00 6.84 4.79 5.31
experts 6.34 7.43 4.83 5.53
LVIS dataset 6.35 7.07 5.44 6.00
experts 7.13 8.48 5.91 6.82
(b) Comparison of annotation complexity. Boundary complexity is perimeter divided by square root area [1].
Table 1: Annotation quality and complexity relative to experts.

4 Dataset Analysis

For analysis, we have annotated 5000 images (the COCO val2017 split) twice using the proposed pipeline. We begin by discussing general dataset statistics next before proceeding to an analysis of annotation consistency in §4.2 and an analysis of the evaluation protocol in §4.3.

4.1 Dataset Statistics

Category Statistics.

There are 977 categories present in the 5000 LVIS images. The category growth rate (see Fig. 9) indicates that the final dataset will have well over 1000 categories. On average, each image is annotated with 11.2 instances from 3.4 categories. The largest instances-per-image count is a remarkable 294. Fig. (a)a shows the full categories-per-image distribution. LVIS’s distribution has more spread than COCO’s indicating that many images are labeled with more categories. The low-shot nature of our dataset can be seen in Fig. (b)b, which plots the total number of instances for each category (in the 5000 images). The median value is 9, and while this number will be larger for the full image set, this statistic highlights the challenging long-tailed nature of our data.

Spatial Statistics.

Our object spotting process (stage 1) encourages the inclusion of objects distributed throughout the image plane, not just the most salient foreground objects. The effect can be seen in Fig. 5 which shows object-center density plots. All datasets have some degree of center bias, with ADE20K and LVIS having the most diverse spatial distribution. COCO and Open Images v4 (training set333The CVPR 2019 version of this paper shows the distribution of the Open Images v4 validation set, which has more center bias. The peakiness is also exaggerated due to an intensity scaling artifact. For more details, see https://storage.googleapis.com/openimages/web/factsfigures.html.) have similar object-center distributions with a marginally lower degree of spatial diversity.

Scale Statistics.

Objects in LVIS are also more likely to be small. Fig. (c)c shows the relative size distribution of object masks: compared with COCO, LVIS objects tend to smaller and there are fewer large objects (e.g., objects that occupy most of an image are 10 less frequent). ADE20K has the fewest large objects overall and more medium ones.

4.2 Annotation Consistency

Annotation Pipeline Repeatability.

A repeatable annotation pipeline implies that the process generating the ground-truth data is not overly random and therefore may be learned. To understand repeatability, we annotated the 5000 images twice: after completing object spotting (stage 1), we have initial positive sets for each category ; we then execute stages 2 through 5 (exhaustive instance marking through full recall verification) twice in order to yield doubly annotated positive sets. To compare them, we compute a matching between them for each image and category pair. We find a matching that maximizes the total mask intersection over union (IoU) summed over the matched pairs and then discard any matches with IoU 0.5. Given these matches we compute the dataset average mask IoU (0.85) and the dataset average score (0.87). Intuitively, these quantities describe ‘segmentation quality’ and ‘recognition quality’ [12]. The cumulative distributions of these metrics (Fig. (a)a and (b)b) show that even though matches are established based on a low IoU threshold (0.5), matched masks tend to have much higher IoU. The results show that roughly 50% of matched instances have IoU greater than 90% and roughly 75% of the image-category pairs have a perfect score. Taken together, these metrics are a strong indication that our pipeline has a large degree of repeatability.

Comparison with Expert Annotators.

To measure segmentation quality, we randomly selected 100 instances with mask area greater than pixels from LVIS, COCO, and ADE20K. We presented these instances (indicated by bounding box and category) to two independent expert annotators and asked them to segment each object using professional image editing tools. We compare dataset annotations to expert annotations using mask IoU and boundary quality (boundary  [20]) in Tab. (a)a. The results (bootstrapped 95% confidence intervals) show that our masks are high-quality, surpassing COCO and ADE20K on both measures (see Fig. (c)c for intuition). At the same time, the objects in LVIS have more complex boundaries [1] (Tab. (b)b).

(a) Given fixed detections, we show how AP varies with , the max number of negative images per category used in evaluation.
(b) With the same detections from Fig. (a)a and , we show how AP varies as we vary , the max positive set size.
(c) Low-shot detection is an open problem: training Mask R-CNN on 1k images decreases COCO val2017 mask AP from 36% to 10%.
Figure 8: Detection experiments using COCO and 5000 annotated images from LVIS. Best viewed digitally.
Mask R-CNN test anno. box AP mask AP
R-50-FPN COCO 38.2 34.1
model id: 35859007 LVIS 38.8 34.4
R-101-FPN COCO 40.6 36.0
model id: 35861858 LVIS 40.9 36.0
X-101-64x4d-FPN COCO 47.8 41.2
model id: 37129812 LVIS 48.6 41.7

Table 2: COCO-trained Mask R-CNN evaluated on LVIS annotations. Both annotations yield similar AP values.

4.3 Evaluation Protocol

COCO Detectors on Lvis.

To validate our annotations and federated dataset design we downloaded three Mask R-CNN [9] models from the Detectron Model Zoo [7] and evaluated them on LVIS annotations for the categories in COCO. Tab. 2 shows that both box AP and mask AP are close between our annotations and the original ones from COCO for all models, which span a wide AP range. This result validates our annotations and evaluation protocol: even though LVIS uses a federated dataset design with sparse annotations, the quantitative outcome closely reproduces the ‘gold standard’ results from dense COCO annotations.

Federated Dataset Simulations.

For insight into how AP changes with positive and negative sets sizes and , we randomly sample smaller evaluation sets from COCO val2017

and recompute AP. To plot quartiles and min-max ranges, we re-test each setting 20 times. In Fig. 

(a)a we use all positive instances for evaluation, but vary between 50 and 5k. AP decreases somewhat (2 points) as we increase the number of negative images as the ratio of negative to positive examples grows with fixed and increasing . Next, in Fig (b)b we set and vary . We observe that even with a small positive set size of 80, AP is similar to the baseline with low variance. With smaller positive sets (down to 5) variance increases, but the AP gap from 1st to 3rd quartile remains below 2 points. These simulations together with COCO detectors tested on LVIS (Tab. 2) indicate that including smaller evaluation sets for each category is viable for evaluation.

Low-Shot Detection.

To validate the claim that low-shot detection is a challenging open problem, we trained Mask R-CNN on random subsets of COCO train2017 ranging from 1k to 118k images. For each subset, we optimized the learning rate schedule and weight decay by grid search. Results on val2017 are shown in Fig. (c)c. At 1k images, mask AP drops from 36.4% (full dataset) to 9.8% (1k subset). In the 1k subset, 89% of the categories have more than 20 training instances, while the low-shot literature typically considers 20 examples per category [8].

Figure 9: (Left) As more images are annotated, new categories are discovered. (Right) Consequently, the percentage of low-shot categories (blue curve) remains large, decreasing slowly.

Low-Shot Category Statistics.

Fig. 9 (left) shows category growth as a function of image count (up to 977 categories in 5k images). Extrapolating the trajectory, our final dataset will include over 1k categories (upper bounded by the vocabulary size, 1723). Since the number of categories increases during data collection, the low-shot nature of LVIS is somewhat independent of the dataset scale, see Fig. 9 (right) where we bin categories based on how many images they appear in: rare (1-10 images), common (11-100), and frequent (100). These bins, as measured w.r.t. the training set, will be used to present disaggregated AP metrics.

5 Conclusion

We introduced LVIS, a new dataset designed to enable, for the first time, the rigorous study of instance segmentation algorithms that can recognize a large vocabulary of object categories (1000) and must do so using methods that can cope with the open problem of low-shot learning. While LVIS emphasizes learning from few examples, the dataset is not small: it will span 164k images and label 2 million object instances. Each object instance is segmented with a high-quality mask that surpasses the annotation quality of related datasets. We plan to establish LVIS as a benchmark challenge that we hope will lead to exciting new object detection, segmentation, and low-shot learning algorithms.

Appendix A Lvis Release v0.5

LVIS release v0.5 marks the halfway point in data collection. For this release, we have annotated an additional 77k images (57k train, 20k test) beyond the 5k val images that we analyzed in the previous sections, for a total of 82k annotated images. Release v0.5 is publicly available at http://www.lvisdataset.org and will be used in the first LVIS Challenge to be held in conjunction with the COCO Workshop at ICCV 2019.

Collection Details.

We collected this data in two 38.5k image batches using the process described in the main paper. Each batch contained a proportional mixture of train and test set images. After all stages were completed for the first batch, we (the authors of this paper) manually checked all 1415 categories that were represented in the raw data collection and cast an include vs. exclude vote for each category based on its visual consistency. This process led to the removal of 18% of the categories and 10% of the labeled instances. After collecting the second batch, we repeated this process for the 83 new categories that were newly introduced. After we finish the full data collection for v1 (estimated January 2020), we will conduct another similar quality control pass on a subset of the categories.

LVIS val v0.5 is the same as the set used for analysis in the main paper, except that we: (1) removed categories that were determined to be visually inconsistent in the quality control pass and (2) we removed any categories with zero instances in the training set. In this section, we refer to the set of images used for analysis in the main paper as ‘LVIS val (unpruned)’.

Statistics.

After our quality control pass, the final category count for release v0.5 is 1230. The number of categories in the val set decreased from 977 to 830, due to quality control, and now has 56k segmented object instances. The train v0.5 set has 694k segmented instances.

We now repeat some of the key analysis plots, this time showing the final val and train v0.5 sets in comparison to the original (unpruned) val set that was analyzed in the main paper. The train and test sets are collected using an identical process (the images are mixed together in each annotation batch) and therefore the training data is statistically identical to that of the test data (noting that the train and test images were randomly sampled from the same image distribution when COCO was collected).

Fig. (a)a illustrates the category growth rate on the train set and the val set before and after pruning. We expect only modest growth while collecting the second half of the dataset, perhaps expanding by 100 additional categories. Next, we extend Fig. 9 (right) from 5k images to 57k images using the train v0.5 data, as shown in Fig. (b)b. Due to the slowing category growth, the percent of rare categories (those appearing in 1-10 training images) is decreasing, but remains a sizeable portion of the dataset. Roughly 75% of categories appear in 100 training images or less, highlighting the challenging low-shot nature of the dataset.

Finally, we look at the spatial distribution of object centers in Fig. 11. This visualization verifies that quality control did not lead to a meaningful bias in this statistic. The train and val sets exhibit visually similar distributions.

(a) Category growth comparison between the train and val sets (before and after quality control).
(b) As category growth slows, the percent of rare categories decreases, but remains large (train v0.5 set).
Figure 10: Category growth and frequency statistics for LVIS v0.5. Best viewed digitally. Compare with Fig. 9.
Figure 11: Distribution of object centers in normalized image coordinates for LVIS val (prior to quality control; i.e. the same as in Fig. 5), LVIS val v0.5 (after quality control), and LVIS train v0.5. The distributions are nearly identical.

Summary.

Based on this analysis and our qualitative judgement when performing per-category quality control, we conclude that our data collection process scales well beyond the initial 5k set analyzed in the main paper.

Acknowledgements.

We thank Ilija Radosavovic, Amanpreet Singh, Alexander Kirillov, and Tsung-Yi Lin for their help during the creation of LVIS. We would also like to thank the COCO Committee for granting us permission to annotate the COCO test set. We are grateful to Amanpreet Singh for his help in creating the LVIS website.

Figure 12: Example annotations from our dataset. For clarity, we show one category per image. (NE) signifies that the category was not exhaustively annotated in the image. See http://www.lvisdataset.org/explore to explore LVIS in detail.

References

  • [1] F. Attneave and M. D. Arnoult (1956) The quantitative study of shape and pattern perception.. Psychological bulletin. Cited by: (b)b, §4.2.
  • [2] M. Brysbaert, A. B. Warriner, and V. Kuperman (2014) Concreteness ratings for 40 thousand generally known english word lemmas. Behavior research methods. Cited by: §3.2.
  • [3] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The Cityscapes dataset for semantic urban scene understanding

    .
    In CVPR, Cited by: §1.1.
  • [4] P. Dollár, C. Wojek, B. Schiele, and P. Perona (2012) Pedestrian detection: an evaluation of the state of the art. TPAMI. Cited by: §1.1.
  • [5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The PASCAL Visual Object Classes (VOC) Challenge. IJCV. Cited by: §1.1.
  • [6] L. Fei-Fei, R. Fergus, and P. Perona (2006) One-shot learning of object categories. TPAMI. Cited by: §1.1.
  • [7] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He (2018) Detectron. Note: https://github.com/facebookresearch/detectron Cited by: §4.3.
  • [8] B. Hariharan and R. Girshick (2017) Low-shot visual recognition by shrinking and hallucinating features. In ICCV, Cited by: §4.3.
  • [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask R-CNN. In ICCV, Cited by: §4.3.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1.1.
  • [11] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §1.1.
  • [12] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár (2019) Panoptic segmentation. In CVPR, Cited by: §4.2.
  • [13] A. Krizhevsky, I. Sutskever, and G. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    .
    In NIPS, Cited by: §1.1.
  • [14] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, et al. (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §1.1.
  • [15] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel (1989) Backpropagation applied to handwritten zip code recognition. Neural computation. Cited by: §1.1.
  • [16] Y. LeCun, C. Cortes, and C. J.C. Burges (1998)

    The MNIST database of handwritten digits

    .
    Note: http://yann.lecun.com/exdb/mnist/ Cited by: §1.1.
  • [17] M. Liberman (2015) Reproducible research and the common task method. Note: Simmons Foundation Lecture https://www.simonsfoundation.org/lecture/reproducible-research-and-the-common-task-method/ Cited by: §1.1.
  • [18] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: §1.1, §1.1, §1.
  • [19] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (Accessed Oct 30, 2018) COCO detection evaluation. Note: http://cocodataset.org/#detection-eval Cited by: §1, §2.1.
  • [20] D. Martin, C. Fowlkes, D. Tal, and J. Malik (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, Cited by: §1.1, §4.2.
  • [21] G. Miller (1998) WordNet: an electronic lexical database. MIT press. Cited by: §2.3.
  • [22] G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder (2017) The mapillary vistas dataset for semantic understanding of street scenes.. In ICCV, Cited by: §1.1.
  • [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV. Cited by: §1.1.
  • [24] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman (2008) LabelMe: a database and web-based tool for image annotation. IJCV. Cited by: §1.
  • [25] M. Spain and P. Perona (2007) Measuring and predicting importance of objects in our visual world. Technical report Technical Report CNS-TR-2007-002, California Institute of Technology. Cited by: §1.
  • [26] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie (2018) The iNaturalist species classification and detection dataset. In CVPR, Cited by: §1.1.
  • [27] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010) SUN database: large-scale scene recognition from abbey to zoo. In CVPR, Cited by: §1.
  • [28] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2019) Semantic understanding of scenes through the ADE20K dataset. IJCV. Cited by: §1.1, §1.
  • [29] G. K. Zipf (2013) The psycho-biology of language: an introduction to dynamic philology. Routledge. Cited by: §1.