Web-Scale Generic Object Detection at Microsoft Bing

07/05/2021 ∙ by Stephen Xi Chen, et al. ∙ Microsoft 0

In this paper, we present Generic Object Detection (GenOD), one of the largest object detection systems deployed to a web-scale general visual search engine that can detect over 900 categories for all Microsoft Bing Visual Search queries in near real-time. It acts as a fundamental visual query understanding service that provides object-centric information and shows gains in multiple production scenarios, improving upon domain-specific models. We discuss the challenges of collecting data, training, deploying and updating such a large-scale object detection model with multiple dependencies. We discuss a data collection pipeline that reduces per-bounding box labeling cost by 81.5 and latency by 61.2 can improve weighted average precision by over 20 domain-specific models. We also improve the model update agility by nearly 2 times with the proposed disjoint detector training compared to joint fine-tuning. Finally we demonstrate how GenOD benefits visual search applications by significantly improving object-level search relevance by 54.9 and user engagement by 59.9



There are no comments yet.


page 1

page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Overview of Generic Object Detection (GenOD) in Microsoft Bing Visual Search stack. GenOD is a fundamental image understanding service that provides object level information for a given user query image to trigger multiple scenario services, improve search relevance, and provide user interface hotspots to multiple experience endpoints.

Visual search, in which an image constitutes the user query, is an emerging modality of search that allows users to provide a new class of queries beyond text-based search. This search solution requires us to intelligently identify visual concepts, retrieve visually and semantically similar images, search for product information, or get inspiration from other images. Understanding and representing the query image is the critical first step of visual search. Many commercial visual search systems represent query images with image-level embeddings. However, this assumes that the query image is focused on a single object with a clean and simple background which often does not hold true in real world scenarios with mobile captured images.

Figure 2. Examples of user interfaces with interactive hotspots detected by the Generic Object Detection (GenOD) in the Bing Visual Search experiences. Left: The experience in Bing Image Details Page on desktop, allows users to click on hotspots to search for related products. Right: The Bing Mobile Camera experience detects objects in real-time to allow the user to quickly choose which object is of interest to them.

Object detection has been introduced to several visual search engines (Jing et al., 2015; Hu et al., 2018; Bell et al., 2020; Zhang et al., 2018) to better parse user intent. Given an image, object detection aims to locate and recognize objects from a predefined set of categories. Given their business scenarios, these systems tend to use object detection to display hotspots or remove background noise of objects in scoped segments like shopping. For a web-scale, general-purpose visual search engine like Microsoft Bing, there are numerous search query segments and application scenarios and it is imperative to have a comprehensive and scalable object-based understanding of images at a generic level.

In this paper, we present how we built Generic Object Detection (GenOD) as a platform service at Microsoft Bing. Figure 1 depicts the overview of GenOD in the Bing Visual Search stack. Starting from domain-specific detection models, object detection in Bing has evolved to the generic object level with a taxonomy of over categories, making it one of the largest deployed object detection models in production. With the ability to detect a wide range of objects, GenOD fuels multiple applications including visual product search, image content-based triggering and post-processing, fine-grained entity recognition, fine-grained attribute classification, and real-time object detection in camera search experiences 111https://bing.com/camera. Figure 2 showcases how users interact with detected object hotspots in Bing Images Details Page and Bing Mobile camera based search experience. The challenges of building such a versatile system can be broken down into three main aspects:

Data collection and processing for a large vocabulary Collecting object detection annotations for hundreds of categories at the scale required for deep models is much slower and more costly than getting image class labels (Su et al., 2012) and can be prohibitively expensive when using expert judges. The task is also fairly complex for crowd platforms, especially because it quickly becomes infeasible for judges to keep track of hundreds of categories. Even if one can leverage recently released large-scale object detection datasets such as OpenImages (Kuznetsova et al., 2020), LVIS (Gupta et al., 2019) and VisualGenome (Krishna et al., 2017) with hundreds to thousands of categories, determining the best way to combine these resources remains an open issue. Compared to conventional object detection models trained on a single dataset with a small vocabulary (Lin et al., 2014; Everingham et al., 2010), training a unified large-scale detection model by combining several diverse datasets faces new challenges including: (1) long-tailed category distribution: this is especially the case in natural images when the number of categories grows 10 times larger. The rare classes often perform poorly with few training samples. (2) hierarchical labels: as the taxonomy grows, each object instance naturally has multiple valid labels as part of a hierarchy. For example, an apple can be labeled as ”Apple” and ”Fruit” because both categories are in the taxonomy. This will introduce missing and noisy label issues because not all object instances can be exhaustively annotated either by humans or oracle models, so it poses a serious barrier in both model training and evaluation. (3) imbalance between datasets: Some of the datasets are much larger than others in size with specific distributions, which would be likely dominate model training and cause poorer generalized performance.

Agility of model development

Continuously iterating machine learning models deployed in online systems remains difficult due to: (1)

heavy model training: conventional way of model training is an all-or-nothing change, reducing update agility. (2) production non-regression requirements: when a new model is deployed to production, it is important not to regress in performance for any downstream task dependent on the model. However, with the increasing number of categories and dependencies, improving the model for a certain task or subset of categories may lead to a decline in the performance of others, which would block model deployment. Therefore it is imperative to have a novel architecture design to meet such strict requirements.

Latency-accuracy tradeoff The visual search stack in Microsoft Bing has strict near real-time inference requirements, especially for applications like Bing mobile camera. Since GenOD is required to run for all requests, latency of the model is a key criterion in model training and selection.

The key contribution of this paper is a detailed description of how we overcome the challenges mentioned above to design and deploy a large-scale generic object detection system in an industry setting that is adaptable to rapidly changing business needs. Specifically, our contributions are as follows:

  • We discuss the design of a low-cost, high-throughput data collection pipeline that can easily scale to new categories.

  • We discuss how we handle the imbalance in category and dataset distributions while combining multiple datasets in training a large-scale unified generic object detection model. We evaluate on the various academic and internal benchmarks to demonstrate the efficacy of the model with good speed-accuracy trade-offs and show that a generic large-scale model is able to beat domain-specific models.

  • We propose an architecture design of disjoint detectors on a shared backbone pretrained for general purpose object detection, in order to tackle the challenge of agile model updates in a production context.

  • We describe how we serve GenOD at web scale with low latency, and demonstrate its impact as a fundamental service in Bing Visual Search to improve user engagement and relevance for a wide range of deployed applications in Microsoft through offline and online A/B tests.

The rest of the paper is organized as follows: We first review related literature in Section 2 then introduce our data collection pipeline in Section 3. Our model design, training and deployment is described in Section 4 and we include corresponding experiments in Section 5. Finally we demonstrate the applications of GenOD in Bing Visual Search in Section 6.

2. Related Work

2.1. Object detection in visual search

Major companies (Google, 2017; Krishnan, 2019; Jing et al., 2015; Bell et al., 2020; Yang et al., 2017; Hu et al., 2018; Zhai et al., 2017; Shiau et al., 2020; Zhang et al., 2018) have been developing visual search systems to satisfy an increasing demand for content-based retrieval. Facebook (Bell et al., 2020) and Alibaba (Zhang et al., 2018) perform class-agnostic object localization to remove background noise and retrieve images at object level to improve product search relevance. Pinterest (Jing et al., 2015; Zhai et al., 2017; Shiau et al., 2020) displays hotspots on objects in a few shopping segments including fashion, home decor and vehicles.  (Hu et al., 2018) leveraged object detection to improve engagement and relevance in the web-scale responsive visual search system in Microsoft Bing. However, most of these systems target a limited set of shopping domains and only cover a small set of categories. Google Lens (Google, 2017) was one of the first attempts to apply generic object detection for visual search, but a detailed analysis of their system has not been published yet. To the best of our knowledge, this paper is the first work to comprehensively discuss the challenges and solutions for developing a web-scale generic object detection service in a production system.

2.2. Large scale generic object detection

With the advance of deep neural networks (DNN), the research community is moving towards the challenging goal of building a generic object detection system that can detect a broad or even open-ended range of objects like humans 

(Liu et al., 2020). Numerous object detection architectures have emerged during the last decade. Two-stage detectors (Ren et al., 2015; He et al., 2017) were first proposed to apply DNNs end-to-end to a region proposal network and a detection stage; one-stage detectors (Redmon and Farhadi, 2018; Lin et al., 2017b; Tan et al., 2020) and anchor-free approaches (Law and Deng, 2018; Tian et al., 2019) were proposed later with attempts to predict objects without region proposals and anchor boxes, respectively, to further improve speed-accuracy trade-offs (Huang et al., 2017). For a more comprehensive survey in the area please refer to  (Liu et al., 2020). However, most of these architectures in academic settings seldom consider the agility to add or update categories without regressing others, making them less adaptive in an industry product setting with rapidly changing business needs.

Prevalent works on general purpose object detection are mostly performed on a predefined small set of categories with relatively adequate and balanced training samples (e.g.1001000+) for each category (Lin et al., 2014; Everingham et al., 2010). Generic object detection with large vocabulary, in contrast, poses new challenges including long-tail distribution for data collection and model training. Some large-scale datasets (Kuznetsova et al., 2020; Gupta et al., 2019; Shao et al., 2019) have been collected to facilitate further research in this scenario, in which challenges and solutions to data collection have been discussed. Recent studies to address the challenges of long-tail distribution include data distribution re-balancing (Mahajan et al., 2018; Gao et al., 2018), class-balanced losses (Lin et al., 2017b)

, decoupling representation and classifier learning 

(Kang et al., 2019; Li et al., 2020). This paper mainly experiments with the data distribution re-balancing approach as a simple but robust baseline, but other directions of research in long-tail object detection could be applied in the future.

3. Data collection

In this section, we describe the methodology used to collect data at scale to power the GenOD service. Given the large nature of the vocabulary, it is imperative from a production standpoint to have a robust pipeline that is high quality, cost-efficient, high throughput and agile to taxonomy changes and business needs. Previous iterations (Hu et al., 2018) which relied on ad-hoc collection of data through 3rd party vendors or managed human judges were slow and expensive. We also found that a unified large vocabulary necessitated careful data collection design since it was infeasible for human judges to label images while keeping hundreds of categories in mind. We leveraged crowd platforms to access a large pool of judges for high throughput and cost-efficient labeling. Since crowd platforms are generally not suited for complex annotation tasks, we adapted the orchestration in (Gupta et al., 2019) for object detection. An overview of the pipeline can be seen in Figure 3222Image from https://cutetropolis.com/2016/08/31/links-thats-the-way-they-became-the-brady-bunch by Mike Brailer/ CC BY-SA 4.0. The key stages in the pipeline are described below:

Figure 3. The GenOD data collection pipeline is designed as a chain of micro-tasks suited for judging on crowd platforms. It has 3 main stages: category discovery, instance marking and bounding box drawing. Micro-tasks with complex annotations which cannot be easily aggregated are followed by verification micro-tasks with high overlap to ensure quality. The pipeline is orchestrated so crowd judges only have to annotate a single category or marker at a time.

Category Discovery

The goal of this stage is to discover all the salient object categories in the image. This is challenging given that there are hundreds of categories and it may be exhausting to label every single object instance in an image. To solve this issue, we ask judges to only discover a single new category by placing a marker on an instance or skip if there are no salient object categories to be added (previously marked categories are shown to the judge). This is repeated until 3 consecutive judges skip, at which point we consider all salient object categories have been discovered. We also employ careful user-interface design so the judge can navigate a hierarchy of categories or directly start typing the name of the category to search for the appropriate category. Unlike (Gupta et al., 2019), the user interface replaces the cursor with a circle with a size corresponding to the minimum area in an image for salient objects. This ensures judges are not spending time marking insignificant objects that are not important from a user scenario standpoint. We also have a simpler variant of this application where a judge only has to spot a specified category rather than provide the names of new categories. The simpler variant is used when we want to quickly collect data for a single new category for business needs. With these two variants, we are able to quickly discover concepts for our vocabulary while also being agile about adding annotations if the taxonomy expands. After category discovery, we run a marker verification micro-task to ensure that all the marked categories are correct.

Instance marking

The goal of this stage is to mark all the instances of the categories discovered in the previous stage. We ask a judge to mark every instance of a given category and follow it up with two quality control micro-tasks: (1) Verify that all instances have been marked (2) Verify all the markers are correct. At the end of this stage, we have markers for all the salient object instances in the given set of images.

Bounding box drawing

The goal of this stage is to draw a bounding box for a given category marker. This is followed up by a bounding box verification micro-task to ensure quality. By decoupling the drawing of the bounding box from the marker, the data collection pipeline is flexible to accommodate future needs such as segmentation.

Negative set selection

The goal of this stage is to collect a set of images for a category such that no instance of that category exist in the images. This stage is not necessary while collecting training data, but is useful for the federated measurement design described in  (Gupta et al., 2019).

3.1. Annotation evaluation

We evaluate the proposed pipeline against the baseline data collection approach which used managed vendor judges. To capture the statistics of camera and web-style images appropriately, we randomly sampled 500 images from each distribution for a total evaluation set of 1k images. When comparing the proposed pipeline’s results to the existing baseline annotations, we find that 85.75% of baseline instances (93.5% if we exclude objects with smaller dimension ¡ 55 pixels) are correctly localized, and 97% of the markers are verified as correctly categorized. For a more rigorous comparison that is not biased to the baseline or any particular vocabulary as groundtruth, we sent a subsample of 100 images to expert judges to annotate all salient objects as groundtruth and also verify correctness of bounding boxes provided by each data pipeline. We measure precision for each candidate pipeline’s provided bounding boxes and recall against the expert-provided salient bounding boxes. We can see the metrics for quality, cost and latency in Table 1.

While the throughput at the image level is slightly worse than our baseline approach, this is mainly because our pipeline is more successful at finding more instances to be labeled per image. We ran 2 labeling tasks with 100 and 1000 samples respectively and found that the time taken to get a fully annotated image decreased from 9.3 mins to 4.85 mins. As demonstrated in  (Lasecki et al., 2014), task interruption and context switching decreases the efficiency of workers while performing micro-tasks on a crowdsourcing platform. Judges are more likely to work on a task when a lot of data is available to be judged. This suggests that even at the image level, throughput can be increased further by sending larger batches which optimize for the capacity of the crowd platform.

Statistics Quality Cost Latency
#Categories/img #Bboxes/img Precision Recall $/img $/bbox Time(mins)/img Time(mins)/bbox
Baseline 2.3 3.18 0.959 0.602 1.89 0.65 4.3 0.67
GenOD data pipeline 7.4 14.41 0.933 0.859 1.63 0.12 4.85 0.26
Table 1. Comparison of the GenOD data collection pipeline to the previous method. We find that the new pipeline can discover more object instances, annotate faster, and more cost-effectively per bounding box compared to the baseline.

4. Approach

In this section, we describe how we developed the GenOD service including data processing to mitigate the serious imbalance in category distribution and dataset sizes, model architecture selection in pursuit of a good speed-accuracy trade-off, and training protocols to achieve a balance between high-performing deep models and model update agility. Our strategy is to first train a single unified base model with a large amount of data to get a generic backbone and have a default detector head for all categories which is easy to maintain and updated less frequently. We improve on this design with the concept of disjoint detectors on the shared backbone, which allows for agile, targeted updates while not disrupting downstream dependencies. Finally we discuss how we deal with the system challenges in scalable serving with low latency.

4.1. Training Data

We combine several large-scale object detection datasets such as Bing internal datasets and open source datasets in our training data. These datasets vary from each other greatly in domain (eg. fine-grained commerce versus generic concepts), the number of images, number and hierarchy of categories as well as the coverage and density of objects per image. Therefore combining these heterogeneous datasets for training with a unified taxonomy is a non-trivial task. Another challenge with such a large vocabulary is the long-tailed, extremely imbalanced category distribution, as shown in the red curve in Figure 

4. Directly training on such imbalanced data would lead to poor performance on the tail categories. There is also an imbalance in the number of images from different datasets, ranging from several millions to a few 100 thousands. Therefore, training would be dominated by the distribution in larger datasets while smaller datasets would be under-represented.

To combine these diverse and imbalanced datasets, we first built a unified GenOD taxonomy with both top-down human design and bottom-up mapping of the categories from all the datasets; organized in a hierarchical manner. To alleviate the poor performance of rare categories with few training samples, we employ a simple yet effective approach of offline class-aware upsampling (Gao et al., 2018) on the tail classes, during which all the categories will be upsampled to at least instances in the training set. In our experiments we use as we found it works well empirically. Figure 4 shows the class-wise distribution in our training set before and after the class-aware sampling. With class-aware sampling, we obtained a total of million training images and million objects. We denote this training set as .

Figure 4.

Dataset label distribution before and after applying class-aware sampling. We can see the originally skewed distribution becomes more balanced. The class IDs are sorted by the number of annotated bounding boxes in the original distribution.

To address the imbalance between different datasets, we also downsample the larger datasets offline. This gives us a training set with million images and million objects, which we denote as .

4.2. Base model architectures

We have experimented with model architectures including Faster-RCNN (Ren et al., 2015) and SSD (Liu et al., 2016) for shopping-segment object detection models in our previous work (Hu et al., 2018). However, with an order of magnitude more categories, in this work we consider the speed-accuracy trade-off as our first priority, and focus on the evaluation of singe-stage detectors which demonstrate better speed-accuracy trade-offs since their inception (Redmon et al., 2016; Liu et al., 2016). We evaluate two variants of single-stage detection models: RetinaNet (Lin et al., 2017b) and FCOS (Tian et al., 2019)

, state-of-the-art single-stage detectors for anchor-based and anchor-free models respectively at the time of development of GenOD. Both models have achieved good speed-accuracy trade-off at relatively small tasks like COCO. As the number of categories increases, latency for RetinaNet increases dramatically since it has a large number of per-anchor operations and the last few convolution layers to output the class-wise confidence and bounding box regressions become proportionally larger. On the other hand, since FCOS is anchor-free, it reduces the per anchor operations

times compared to RetinaNet. With a few nearly cost-free improvements, FCOS can achieve better results than previous anchor-based detectors with less floating point operations per second (FLOPs) and latency. The experiments in Section 5 provide a comprehensive, quantitative comparison and analysis of these two models.

4.3. Disjoint detector architecture

In a production setting, it is common to have an urgent business need to support a new category or improve a specific category quickly while also not degrading performance on other categories that may have downstream dependencies. With smaller vocabularies, it can be sufficient to retrain the entire model with a new category or more data, but when scaling to a large vocabulary, it becomes very time-consuming to update the entire model and also guarantee no regression in any of the categories. The base model described in the previous section cannot easily accommodate ad-hoc requests or agile updates.

To address this, we designed the GenOD model as a federation of disjoint detector heads that share a fixed common backbone. New detector heads, which include classification and bounding box regression networks, can be trained and added on top of the backbone without disrupting the other detectors. When there is a need to quickly add or update a category, the data collection process described in Section 3 allows us to quickly collect data for that category and then the disjoint principle allows us to update GenOD service with much less data and without disrupting any production dependencies. We explore this through a prototypical experiment in Section 5.3.

4.4. Deployment and Serving

Service latency is an important factor for a core service like GenOD, therefore we deploy the GenOD models to Microsoft Azure GPU clusters. To serve the GenOD models on GPUs, we first convert them to ONNX models and use ONNX Runtime (Microsoft, ) as the backend for inference, which provides an speed-up. We built a wrapper of ONNX Runtime on Windows and used the Domain Specific Language (DSL) in (Hu et al., 2018), which utilizes a graph execution engine to perform thread-safe inference. To address global scalability issues, we leverage a micro-service system built on Microsoft Service Fabric (Kakivaya et al., 2018) to serve multiple GPU models as micro-services on several global clusters that can scale elastically based on traffic in different regions. Building a cache of detected objects further reduces end-to-end latency. In the end we have built an elastic, scalable GPU serving system for GenOD which can handle hundreds of queries per second across different regions.

5. Experiments

5.1. Evaluation metrics and datasets

Unless specified, in the following sections, we use mean average precision defined in  (Everingham et al., 2010; Lin et al., 2014) as our primary metric, where the average precision (AP) is calculated as the integral of the area under the precision-recall curve in which detections are considered true positives if their intersection-over-union (IOU) with the groudtruths are over 50%. We denote the metric as AP50. AP50 weighs each category equally, however to account for the true distribution of categories seen in production traffic, we also use weighted mean average precision@IOU50, denoted as wAP50:


where and are the weight and the AP50 for class in a validation set of classes, respectively. In our setting, we typically set to the number of instances of in the validation set.

COCO 2017 OpenImagesV5
Val Set
Val Set
AP50 wAP50 AP50 wAP50 AP50 wAP50 AP50 wAP50 AP50 wAP50
RetinaNet GenOD (1.4M) 44.88 46.30 46.44 48.99 53.63 42.66 37.27 53.81 42.16 39.73
FCOS(GenOD ) GenOD (1.4M) 52.36 55.01 54.95 55.69 60.67 51.34 41.01 57.84 52.82 55.17
FCOS(GenOD ) GenOD (3.4M) 50.78 53.47 53.22 55.15 61.57 51.02 37.89 55.01 50.45 52.70
Table 2. Experiments of GenOD models on the 4 validation sets (OpenImagesV5, COCO 2017, Bing internal fashion and home furniture detection datasets), comparing the RetinaNet and FCOS architectures, and two FCOS variants and . Overall GenOD is selected as the model candidate with the best average AP50 and wAP50 metrics.

We evaluate candidates on the validation splits of two public datasets (OpenImagesV5(Kuznetsova et al., 2020), COCO 2017(Lin et al., 2014)) and two of Bing’s internal validation sets in fashion and home furnishing, denoted as Bing-Fashion Val and Bing-HF Val, respectively. We use the average of AP50 and wAP50 metrics over the 4 validation sets as the criteria to select the final model candidate. We then measure final performance on 3 internal test sets: Bing-Generic Test, Bing-Fashion Test and Bing-HF Test which are collected by a weighted sampling of Bing traffic. Note that Bing traffic is much more challenging than the validation data due to a higher proportion of noisy, cluttered scenes in real-world data. For the COCO dataset, we follow the evaluation protocol in  (Lin et al., 2014) and also report the AP@IOU[0.5:0.95], which is simply denoted as AP. For the OpenImages dataset, we follow the same federated evaluation protocol in the OpenImages challenges (Kuznetsova et al., 2020).

5.2. Base model training

We implemented both the RetinaNet and FCOS models based on maskrcnn-benchmark (Francisco and Ross, 2019). Both models are trained with Feature Pyramid Network (FPN) (Lin et al., 2017a) and Focal Loss (Lin et al., 2017b), using ResNet-101 (He et al., 2016) as backbone. For FCOS we employ Modulated Deformable Convolution (DCNv2) (Zhu et al., 2019) at stage 24 and trained the model with the proposed improvements in (Tian et al., 2019)

to further boost the accuracy. Both variants are trained for 24 epochs on the dataset in Section 

4.1 using 8 V100 GPUs, with a batch size of 64 and learning rate of 0.03. To best optimize for online production latency, we use an input image resolution of 400667. We use multi-scale training with the shorter side ranging from 300 to 500 while keeping aspect ratios to adapt to different scales of inputs.

5.2.1. Candidate selection

Table 2 shows the results of the GenOD models on the four validation sets described in Section 5.1. During inference, we map the results from the GenOD taxonomy to the corresponding categories in benchmark datasets for a fair comparison. From the table we can see the trained models with FCOS architecture consistently outperform the RetinaNet one. We denote the FCOS models trained with and as and , respectively. Overall achieves the best performance in the aggregated metrics, so we select this model as our candidate for further evaluation.

5.2.2. Label propagation in the taxonomy hierarchy

We also experiment with leveraging the hierarchical information in the GenOD taxonomy to propagate the bounding boxes and scores of the fine-grained categories to their ancestors in the taxonomy at inference time. For example, if a ”blue jay” is detected, it would also be propagated to generate ”bird” and ”animal” labels. We select OpenImages as the benchmark because it has a meaningful generic hierarchy. We evaluate label propagation on the two FCOS model candidates trained on GenOD and respectively on the OpenImagesV5 validation sets. Significant improvements have been observed over the original predictions without propagation. Specifically, label propagation improves wAP50 of the model from to , and improving the performance of the model from to . Moreover, the AP50 of the model is competitive among the best single models in the OpenImages Detection Challenge 2019333https://storage.googleapis.com/openimages/web/challenge2019.html that are trained with similar backbones on larger resolutions (8001333), showing the effectiveness of our model training and post-processing approach.

AP50 wAP50
60.67 51.34
61.57 51.02
63.27(+2.6) 61.65(+10.31)
64.44(+2.8) 63.16(+12.14)
Table 3. Evaluating label propagation for FCOS model candidates trained on GenOD training data on the OpenImagesV5 validation set.

5.2.3. Comparison with segment models

Table 4 shows the test set results of GenOD using the wAP50 metric. We compare GenOD to two production segment models trained separately for fashion and home furnishing detection using Bing-Fashion Train and Bing-HF Train datasets respectively. It should be noted that each of those sets is contained within . We find that GenOD improves performance on the key product verticals over domain-specific models while significantly reducing the capacity and maintenance cost.

RetinaNet Bing-Fashion Train - 21.43 -
RetinaNet Bing-HF Train - - 27.93
RetinaNet GenOD 28.20 20.29 32.02
GenOD GenOD 29.63 25.85 35.65
Table 4. Evaluation of weighted AP50 on the Bing object detection measurement set.

5.2.4. Latency

In Table 5, we benchmark the latency between the RetinaNet and FCOS variants in the single-GPU and batch-1 setting on the COCO validation sets on V100 GPUs with CUDA 11.0 by averaging the five runs. From the table we can see FCOS is faster than RetinaNet. More interestingly, we observed when scaling up from the 80-class COCO taxonomy to the 900-class GenOD taxonomy, RetinaNet becomes nearly times slower while the latency of FCOS remains stable, which further increases the latency gap between the two models from to . This shows that the last few class-wise convolution layers in an anchor-based model generate significant overhead as the number of categories grows and demonstrating the anchor-free approach is robust in latency against the scaling of vocabulary, making it better suited to large-vocabulary object detection.

Architecture GenOD model COCO model COCO to GenOD
#Classes Latency
RetinaNet ms ms
FCOS ms ms
Table 5. Single GPU batch-1 latency of RetinaNet and FCOS variants of GenOD models on V100 GPU. With the number of categories scaling from 80 classes of COCO to the generic object taxonomy, the speedup of the FCOS architecture grows from to .

5.2.5. Experiments on COCO benchmark

In Table 6 we compare the GenOD model to the models trained on COCO with the same architectures. We can see GenOD consistently outperforms the COCO-trained models especially on small and mid-size objects, even though they target a much broader vocabulary and are not specifically trained for COCO objects.

FCOS COCO 37.1 54.5 39.6 16.4 40.6 56.3
FCOS GenOD 38.3 54.9 41.1 18.0 43.5 56.0
Table 6. Comparison of GenOD model with the FCOS model trained on COCO on the COCO 2017 validation set.

5.3. Disjoint detector training

As described in Section 4.3, here we compare the conventional joint training approach with our disjoint approach with a prototypical experiment. The baseline is the GenOD model with a jointly trained head for all categories using , i.e, GenOD . Given the GenOD model, suppose our goal is to improve the sofa category in response to user feedback, without performance degradation of other categories within a short development cycle. For the update, we consider an additional set of labeled data: . Given this additional data, we conduct three experiments and report the results in Table 7:

  1. [label=()]

  2. Joint detector retraining : We train the single-head joint model with all the available data () using the same training scheme as GenOD as described in Section 5.2.

  3. Joint detector finetuning : We randomly sample 50k images from and finetune the joint detector starting from the GenOD model. For this finetuning stage, we use a smaller learning rate of 0.0001 and train on the data for 12 epochs.

  4. Disjoint detector finetuning: We split the GenOD head to create a disjoint detector head for just the sofa category. We finetune this model on the same dataset (50k randomly sampled images from ) as described in 2 above. We use a learning rate of 0.00003 and train the disjoint detector head of the model for 12 epochs.

As seen from the experimental results in Table 7, disjoint detector finetuning on just a small amount of data is far more agile and allows us to train 300x faster than joint retraining in 1 and 2x faster than joint finetuning in 2, while also improving on the category AP. This is achieved without disrupting the existing model for any of the other categories which allows for stable updates in the production stack compared to the conventional model retraining process.

Baseline: GenOD Joint ImageNet 1.4M 8 days 0.6453
(a) Joint detector retraining Joint ImageNet + 3.4M 19 days 0.6254
(b) Joint detector finetuning Joint GenOD 50k random sample from 50K 188 mins 0.6443
(c) Disjoint detector finetuning Disjoint GenOD 50k random sample from 50K 100 mins 0.6499
Table 7. Evaluation of model update agility for the sofa category. Disjoint training of the targeted category is much faster, while also increasing its AP and keeping other categories stable.

6. Applications

6.1. Object Detection for Visual Product Search

One of the primary applications for GenOD is to help users better formulate visual search queries and improve the relevance of search results. Figure 2 showcases the hotspot interactions in the Bing Image Details Page. GenOD assists the user in formulating a query. Instead of the user having to manually crop to region of interest in the image, detected hotspots are shown over the image for users to start their search with just a tap/click. GenOD’s detected categories can be passed to downstream tasks like similar image search ranker to filter out semantically irrelevant results and improve relevance. We trained and deployed a lightweight category classifier to the index images. The detected category of the query will be sent to match with the categories of the index images, and filter out those images that do not match.

We conducted comprehensive offline and online experiments on the efficacy of GenOD improving visual search experience. We measure the defect rate of the top-5 visual search results from the hotspot clicks on the fashion segment, where defect rate is defined as the average error rate in the categories of the retrieved images. In table 8 we can see that after applying the GenOD categories for filtering, the hotspot click-through defect rate has dropped by , substantially improving the relevance and user experience.

Filtering approach Defect rate@5
W/o object category 38.27%
Object category filtering 17.26% (-54.9%)
Table 8. Defect rates of top-5 visual search results from object detection hotspot clicks (lower is better). With the ranking results filtered semantically by object detection categories, the product search defect rate decreases significantly by 54.9%.
Engagement metrics Gains
%Users entry to Visual Search +14.03%
Overall entry to Visual Search +27.89%
Hotspot clicks per user +59.90%
Table 9. Aggregated online user engagement metrics after deploying GenOD to Bing Visual Search, which shows significant gains over baseline.

We also set up a series of online A/B tests to measure the user engagement before and after deploying GenOD to all Bing Visual Search requests, as shown in Table 9. After aggregating the online user engagement metrics including the percentage of user entries to visual search, overall entries to visual search and hotspot clicks per unique users, GenOD shows significant gains in bringing in more engaging users in visual search; demonstrating the advantage of expanding object detection to generic objects on all images.

6.2. Object-Level Triggering for Fine-grained Visual Recognition

Bing Visual Search runs multiple fine-grained recognition models including animals, plants, landmarks and more. Since these models usually have high latency, it is necessary to perform lightweight triggering before running them. Previous image-level visual triggering models such as the one described in  (Hu et al., 2018) often fail to trigger on small objects or when multiple different objects are in the scene. For a fair comparison with the previous approach, we compare triggering performance at the image level by aggregating the outputs of the GenOD model. In Table 10

we evaluate the image-level triggering precision and recall on the food recognition measurement set. For triggering, we prefer models with higher recall performance. Compared to the image-level triggering model, GenOD improves the triggering recall by detecting and recognizing smaller objects.

Triggering approach Precision Recall
Image-level 99.8 81.3
Object-level 98.5 88.1 (+8.36%)
Table 10. Comparison of triggering precision and recall between image-level and object-level triggering on the food recognition measurement set.

6.3. Bing Mobile Camera

Bing mobile camera is a real-time mobile web experience allowing users to search without having to manually capture a picture. The mobile interface is shown in Figure 2 (right). When a user opens the camera experience and phone is stable, a frame will be captured and sent to the GenOD service to perform real-time object detection. The detected objects will be sent back to the phone to render as hotspots. An in-app object tracking model keeps track of the objects when they are in view, so hotspots can stay on the objects without the need to make additional GenOD requests. Clicking on the detected hotspots provides relevant results to the selected object. GenOD enhances the user experience and simplifies the formulation of visual search queries. Online A/B tests show a reduction of 31.8% in responses with no visual search results compared to our control in which we depend on users to formulate a visual query.

7. Conclusions

We presented GenOD, a web-scale generic object detection service that is fundamental in image understanding for Bing Visual Search and fueling multiple downstream applications. We described an efficient data collection workflow for training the models used in the system. We demonstrated with experiments that the GenOD model has achieved competitive object detection results across production measurement sets and academic benchmarks, with good speed-accuracy tradeoff, and can be updated with agility while being stable to dependencies. Specifically, we have shown that by moving to a large-scale single unified generic detector, GenOD can achieve better results than multiple domain-specific models in each vertical, reducing the cost of maintaining several segment models. Finally we also showed how GenOD benefits visual search applications by significantly improving user engagement and search relevance.

We would like to thank Arun Sacheti, Meenaz Merchant, Surendra Ulabala, Mikhail Panfilov, Andre Alves, Kiril Moskaev, Vishal Thakkar, Avinash Vemuluru, Souvick Sarkar, Li Zhang, Anil Akurathi, Vladimir Vakhrin, Houdong Hu, Rui Xia, Xiaotian Han, Dongfei Yu, Ye Wu, Vincent Chen, Kelly Huang, Nik Srivastava, Yokesh Kumar, Mark Bolin, Mahdi Hajiaghayi, Pengchuan Zhang, Xiyang Dai, Lu Yuan, Lei Zhang and Jianfeng Gao for their support, collaboration and many interesting discussions.


  • S. Bell, Y. Liu, S. Alsheikh, Y. Tang, E. Pizzi, M. Henning, K. Singh, O. Parkhi, and F. Borisyuk (2020)

    GrokNet: unified computer vision model trunk and embeddings for commerce

    In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2608–2616. Cited by: §1, §2.1.
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §1, §2.2, §5.1.
  • M. Francisco and G. Ross (2019)

    Maskrcnn-benchmark: fast, modular reference implementation of instance segmentation and object detection algorithms in pytorch

    External Links: Link Cited by: §5.2.
  • Y. Gao, X. Bu, Y. Hu, H. Shen, T. Bai, X. Li, and S. Wen (2018) Solution for large-scale hierarchical object detection datasets with incomplete annotation and data imbalance. arXiv preprint arXiv:1810.06208. Cited by: §2.2, §4.1.
  • Google (2017) External Links: Link Cited by: §2.1.
  • A. Gupta, P. Dollar, and R. Girshick (2019) LVIS: a dataset for large vocabulary instance segmentation. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 5356–5364. Cited by: §1, §2.2, §3, §3, §3.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.2.
  • H. Hu, Y. Wang, L. Yang, P. Komlev, L. Huang, X. S. Chen, J. Huang, Y. Wu, M. Merchant, and A. Sacheti (2018) Web-scale responsive visual search at bing. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 359–367. Cited by: §1, §2.1, §3, §4.2, §4.4, §6.2.
  • J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7310–7311. Cited by: §2.2.
  • Y. Jing, D. Liu, D. Kislyuk, A. Zhai, J. Xu, J. Donahue, and S. Tavel (2015) Visual search at pinterest. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1889–1898. Cited by: §1, §2.1.
  • G. Kakivaya, L. Xun, R. Hasha, S. B. Ahsan, T. Pfleiger, R. Sinha, A. Gupta, M. Tarta, M. Fussell, V. Modi, et al. (2018) Service fabric: a distributed platform for building microservices in the cloud. In Proceedings of the thirteenth EuroSys conference, pp. 1–15. Cited by: §4.4.
  • B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y. Kalantidis (2019) Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217. Cited by: §2.2.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (1), pp. 32–73. Cited by: §1.
  • A. Krishnan (2019) External Links: Link Cited by: §2.1.
  • A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, et al. (2020) The open images dataset v4. International Journal of Computer Vision, pp. 1–26. Cited by: §1, §2.2, §5.1.
  • W. S. Lasecki, A. Marcus, J. M. Rzeszotarski, and J. P. Bigham (2014) Using microtask continuity to improve crowdsourcing. Technical report Technical Report. Cited by: §3.1.
  • H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV), pp. 734–750. Cited by: §2.2.
  • Y. Li, T. Wang, B. Kang, S. Tang, C. Wang, J. Li, and J. Feng (2020) Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10991–11000. Cited by: §2.2.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017a) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §5.2.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017b) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §2.2, §2.2, §4.2, §5.2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §2.2, §5.1, §5.1.
  • L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen (2020) Deep learning for generic object detection: a survey. International journal of computer vision 128 (2), pp. 261–318. Cited by: §2.2.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §4.2.
  • D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. Van Der Maaten (2018) Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: §2.2.
  • [26] Microsoft(Website) External Links: Link Cited by: §4.4.
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §4.2.
  • J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §2.2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §2.2, §4.2.
  • S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun (2019) Objects365: a large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8430–8439. Cited by: §2.2.
  • R. Shiau, H. Wu, E. Kim, Y. L. Du, A. Guo, Z. Zhang, E. Li, K. Gu, C. Rosenberg, and A. Zhai (2020) Shop the look: building a large scale visual shopping system at pinterest. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3203–3212. Cited by: §2.1.
  • H. Su, J. Deng, and L. Fei-Fei (2012) Crowdsourcing annotations for visual object detection. In

    Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence

    Cited by: §1.
  • M. Tan, R. Pang, and Q. V. Le (2020) Efficientdet: scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781–10790. Cited by: §2.2.
  • Z. Tian, C. Shen, H. Chen, and T. He (2019) Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9627–9636. Cited by: §2.2, §4.2, §5.2.
  • F. Yang, A. Kale, Y. Bubnov, L. Stein, Q. Wang, H. Kiapour, and R. Piramuthu (2017) Visual search at ebay. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2101–2110. Cited by: §2.1.
  • A. Zhai, D. Kislyuk, Y. Jing, M. Feng, E. Tzeng, J. Donahue, Y. L. Du, and T. Darrell (2017) Visual discovery at pinterest. In Proceedings of the 26th International Conference on World Wide Web Companion, pp. 515–524. Cited by: §2.1.
  • Y. Zhang, P. Pan, Y. Zheng, K. Zhao, Y. Zhang, X. Ren, and R. Jin (2018) Visual search at alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 993–1001. Cited by: §1, §2.1.
  • X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9308–9316. Cited by: §5.2.