FoodLogoDet-1500: A Dataset for Large-Scale Food Logo Detection via Multi-Scale Feature Decoupling Network

08/10/2021 ∙ by Qiang Hou, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences 0

Food logo detection plays an important role in the multimedia for its wide real-world applications, such as food recommendation of the self-service shop and infringement detection on e-commerce platforms. A large-scale food logo dataset is urgently needed for developing advanced food logo detection algorithms. However, there are no available food logo datasets with food brand information. To support efforts towards food logo detection, we introduce the dataset FoodLogoDet-1500, a new large-scale publicly available food logo dataset, which has 1,500 categories, about 100,000 images and about 150,000 manually annotated food logo objects. We describe the collection and annotation process of FoodLogoDet-1500, analyze its scale and diversity, and compare it with other logo datasets. To the best of our knowledge, FoodLogoDet-1500 is the first largest publicly available high-quality dataset for food logo detection. The challenge of food logo detection lies in the large-scale categories and similarities between food logo categories. For that, we propose a novel food logo detection method Multi-scale Feature Decoupling Network (MFDNet), which decouples classification and regression into two branches and focuses on the classification branch to solve the problem of distinguishing multiple food logo categories. Specifically, we introduce the feature offset module, which utilizes the deformation-learning for optimal classification offset and can effectively obtain the most representative features of classification in detection. In addition, we adopt a balanced feature pyramid in MFDNet, which pays attention to global information, balances the multi-scale feature maps, and enhances feature extraction capability. Comprehensive experiments on FoodLogoDet-1500 and other two benchmark logo datasets demonstrate the effectiveness of the proposed method. The FoodLogoDet-1500 can be found at this https URL.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Some samples from FoodLogoDet-1500. Green boxes: ground-truth boxes.

Logo detection research has always been extensively studied in the field of multimedia (Gao et al., 2014; Kalantidis et al., 2011; Revaud et al., 2012; Romberg and Lienhart, 2013; Eggert et al., 2015). As one significant task in logo detection, food logo detection can be applied for healthy diet recommendation, food trademark infringement dispute, food advertising placement and supermarket self-checkout system. For example, with the rapid development of e-commerce platforms, many food businesses ignore copyright awareness in pursuit of profits, resulting in irreparable losses. Through food logo detection, we can avoid trademark infringement by detecting the new food logo and comparing their similarity with existing food logos. Furthermore, when we detect the logo from food products, we can further conduct various health-related and sell advertising applications (Min et al., 2019).

Despite its great potential applications, food logo detection is still a challenging task, and the challenge mainly derives from two aspects:

  • There is a lack of large-scale food logo dataset for food logo detection. Existing works mainly focus on messy logo datasets for logo detection, such as FlickrLogos-32 (Romberg et al., 2011) and QMUL-OpenLogo (Su et al., 2018). For example, Romberg et al. (Romberg et al., 2011) introduce the FlickrLogos-32 dataset with 32 categories but only very few classes belong to food logos. Su et al. (Su et al., 2018)

    release one logo dataset with 352 categories, and full annotations. However, it is not all about food logos. Existing logo datasets only contain a tiny number of food logo categories. Therefore, they are probably not sufficient to construct more complicated deep learning models for food logo detection.

  • There are multi-scale and similar logos from food logo images, which are harder to detect in many cases. Compare with other logo images, the multi-scale and similar food logos of the food logo images are more complicated, and make it difficult to accurately extract effective features. As shown in Fig. 1, the first row represents eight different food logo images. However, two classes of food logos are so similar that they are difficult to distinguish, such as ‘Chips Ahoy’ and ‘Chips More’. Different brands of the same food may have similar food logos, which makes detection more difficult. Some food logos look like text, and they also have occlusion problems. Different food logos also have the characteristics of multi-scale, the second row illustrates this, such as ‘MALLOW OATS’ and ‘Calbee’. These characteristics lead to the difficulty of food logo detection.

In this work, we address data limitations by building a large-scale dataset FoodLogoDet-1500 with 1,500 categories, 99,768 images and 145,400 objects. As the largest food logo detection dataset so far, FoodLogoDet-1500 brings great opportunities and challenges for food logo detection in general and sophisticated scenarios. To solve another challenge, we propose a Multi-scale Feature Decoupling Network (MFDNet) to improve food logo detection. This is achieved by two main modules named Feature Offset Module (FOM) and Balanced Feature Pyramid (BFP). FOM firstly decouples classification and regression into two branches, and then utilizes the deformation-learning for optimal classification offset. Finally, the optimal classification offset is merged with the original features of the network. The experiment proves that FOM can improve the classification accuracy in the food logo detection. In addition, we adopt BFP in MFDNet, which pays attention to global information and has a good performance on multi-scale food logos.

To summarize, our paper main contributions are as follow:

  • We first introduce a large-scale and highly diverse food logo dataset FoodLogoDet-1500 with 1,500 categories, 99,768 images and 145,400 objects.

  • We propose a Multi-scale Feature Decoupling Network for food logo detection by decoupling shared heads of classification and regression at the same time. In this network, we further introduce a balanced feature pyramid to ensure the detection of multi-scale food logos.

  • We conduct extensive evaluation on three datasets, including FoodLogoDet-1500 and other two standard logo datasets QMUL-OpenLogo, FlickrLogos-32, and verify the effectiveness of our proposed method.

2. Related Work

Figure 2. Sorted distribution of the number of images from each food logo in the FoodLogoDet-1500.
Dataset Logos Images Objects Availability
BelgaLogos (Joly and Buisson, 2009) 37 10,000 - Yes
FlickrLogos-27 (Kalantidis et al., 2011) 27 1,080 4,671 Yes
FlickrLogos-32 (Romberg et al., 2011) 32 8,240 5,644 Yes
Top-Logo-10 (Su et al., 2017b) 10 700 - Yes
WebLogo-2M (Su et al., 2017a) 194 1,867,177 - Yes
QMUL-OpenLogo (Su et al., 2018) 352 27,083 - Yes
Logos-in-the-Wild (Tüzkö et al., 2017) 871 11,054 32,850 Yes
Logo-2K+ (Wang et al., 2020c) 2,341 167,140 - Yes
LogoDet-3K (Wang et al., 2020b) 3,000 158,652 194,261 Yes
MICC-Logos (Sahbi et al., 2012) 13 720 - No
FlickrBelgaLogos (Letessier et al., 2012) 34 10,000 2,695 No
Logo-18 (Hoi et al., 2015) 18 8,460 16,043 No
Logo-160 (Hoi et al., 2015) 160 73,414 130,608 No
Logos-32plus (Bianco et al., 2017) 32 7,830 12,302 No
Video SportsLogo (Liao et al., 2017) 20 2,000 - No
CarLogo-51 (Xie et al., 2014) 51 11,903 - No
Open Brands (Jin et al., 2020) 1,216 1,437,812 3,113,828 No
SynthLogo (Mas et al., 2018) 604 280,000 - No
PL2K (Fehérvári and Appalaraju, 2019) 2,000 295,814 - No
FoodLogoDet-1500 1,500 99,768 145,400 Yes
Table 1. Comparison between FoodLogoDet-1500 and existing logo datasets.

This section presents related work in the areas of logo datasets and logo detection.

2.1. Logo Datasets

The large-scale datasets play an indispensable role in current object detection algorithms, and it is no exception in food logo detection. In object detection, MS COCO 

(Lin et al., 2014)

and PASCAL VOC 

(Everingham et al., 2010) are the most commonly used datasets. In logo detection, FlickrLogos-32 (Romberg et al., 2011) is the most popular dataset. However, it only consists of 32 logo categories with 70 images in each category. Similarly, Top-Logo-10 (Su et al., 2017b) contains fewer logo categories and fewer images. Logo-2K+ (Wang et al., 2020c) belongs to image-level dataset and cannot be used for logo detection. QMUL-OpenLogo (Su et al., 2018) consists of 352 logo categories, but it is an all-encompassing logo dataset (e.g., Foods, Clothes, Necessities), and it can not be used for food logo detection. Wang et al. introduce LogoDet-3K (Wang et al., 2020b), which has 3,000 categories, where only 932 classes belong to food logos. In order to promote the development of food logo detection in the multimedia community, we make supplementary improvements on food logos of LogoDet-3K. And then FoodLogoDet-1500 was built completely. In addition, some researchers construct other logo datasets, such as Logo-160 (Hoi et al., 2015) and Open Brands (Jin et al., 2020). These logo datasets are not currently available to the public, and are not helpful to the development of logo detection research.

Food logo detection is an important branch of logo detection. However, those have not publicly available food logo datasets with brand information at present. Therefore, we introduce a new large-scale food logo dataset FoodLogoDet-1500 with 1,500 food logo categories. Table 1 summarizes the statistics of existing logo datasets and FoodLogoDet-1500. To the best of our knowledge, FoodLogoDet-1500 is the first largest publicly available high-quality dataset for food logo detection. It helps to promote the development of food logo detection research.

2.2. Logo Detection

Typically, logo detection is performed by adapting object detection methods in the domain of commercial logos (Iandola et al., 2015) (i.e., treating each logo as a different object or class). Traditionally, the use of hand-crafted features, such as SIFT (Lowe, 1999) and textures (Haralick et al., 1973)

, along with statistical classifiers, such as Support Vector Machines (SVM) 

(Cortes and Vapnik, 1995)

, have been the main approaches for object detection. In the last few years, deep learning has shown its good performance in object detection, and Convolutional Neural Networks (CNNs) 

(LeCun et al., 2015; Schmidhuber, 2015) was the core element of deep learning methods. Multiple methods for object detection using CNNs have been presented. In general, object detectors could be divided into two types: two-stage detector and one-stage detector. Two-stage means that the object detection algorithm needs to be completed in two steps. First, candidate regions need to be obtained and then classified, such as R-CNN series like Fast R-CNN (Girshick, 2015) and Faster R-CNN (Ren et al., 2015). On the other hand, one-stage detector, which can be understood as one-step detection, does not need to search for candidate regions separately, typically including SSD (Liu et al., 2016) and YOLO (Redmon et al., 2016; Redmon and Farhadi, 2017, 2018; Bochkovskiy et al., 2020). Recently, anchor-free methods (Law and Deng, 2018) and transformers (Carion et al., 2020) for object detection are widely used.

Multi-scale feature fusion is one of the most important research hotspots in deep networks. Low-level features generally lack semantic information but rich in keeping geometric details, which is the opposite for high-level features. FPN (Lin et al., 2017a) first built a top-down architecture with lateral connections to extract features across multiple layers. PANet (Liu et al., 2018) directly created a short path for low-level feature maps since detecting large objects also needs the assistance of location-sensitive feature maps. Libra R-CNN (Pang et al., 2019) improved the level of feature fusion by adding a non-local block to fine-tune the combined feature maps.

Figure 3. The detailed statistics of FoodLogoDet-1500.

Detection heads are also one of the focuses of research. Mask R-CNN (He et al., 2017) brought in an extra head for instance segmentation. IoU-Net (Jiang et al., 2018) introduced a branch to predict IoUs between detected bounding boxes and their corresponding ground truth boxes. FCOS (Tian et al., 2019) added a single-layer branch, which is parallel to the classification branch, to predict the centerness position. Double-Head R-CNN (Wu et al., 2020) proposed to disentangle the sibling head into two independent branches for classification and localization. Song et al. (Song et al., 2020) also dealt with the classification branch and the regression branch and obtained a relatively good detection result.

Different from these works, our work decouples classification and regression into two branches and focuses on the classification branche to solve the problem of distinguishing multiple food logo categories. At the same time, we consider the multi-scale and similar characteristics of food logos, and use multi-layer features for food logo detection.

3. FoodLogoDet-1500

In order to obtain a high-quality food logo dataset with high diversity and high coverage. We build FoodLogoDet-1500 from the following three steps: (1) Constructing the Food Logo Category List. In order to guarantee wide coverage of the food logo category list, we resort to the widely used shopping application Taobao and Jingdong, also with Wikipedia to construct the food logo category list.  (2) Collecting Food Logo Images.

Using a query term from the constructed food logo category list, we crawled candidate images from various search engines ( i.e., Google, Bing and Baidu ) for broader coverage and higher diversity of food logo images compared with other datasets from only one data source. At the same time, we also added some scene words to ensure more complexity and better diversity of the captured food logo images, such as Coca Cola + Supermarket and Heineken + Bar. 

(3) Cleaning and Labeling Food Logo Images. We checked each category manually to ensure that each image contained the corresponding food logo. It is worth noting that we focused on the food logo rather than the food itself. We also deleted repetitive images and images with incomplete RGB channels. Labeling is not only the most important step in creating a dataset, but also the most complicated step. Every food logo object needs to be annotated, regardless of which image it is placed on. We kept the low-resolution, incomplete food logo images to enhance the challenge of the dataset. After labeling, we then conduct manual verification by crowd-sourcing the task to 13 Lab members. In addition, a food brand may have two or more different types of logos, such as graphic logos and textual logos. We treat different logo variations of the same brand as distinct food logo classes, similar to (Tüzkö et al., 2017). Note that the suffix ‘-1’, ‘-2’ is added to the logo name as the new logo category, such as ‘Maruchan-1’ presents the ‘Maruchan’ graphic logo while ‘Maruchan-2’ presents its textual logo for the brand ‘Maruchan’.

Figure 4. Overview of proposed Multi-scale Feature Decoupling Network (MFDNet) for food logo detection. BFP: Balanced Feature Pyramid. FOM: Feature Offset Module. RPN: Region Proposal Network. FC: Full-Connected layer.

After completing the construction of the FoodLogoDet-1500, in order to show the details of our dataset, we provide the statistics at the category levels. Fig. 2 shows sorted distribution of the number of images from sampled classes, we can see that imbalanced distribution across different food logo categories are one characteristic of FoodLogoDet-1500, posing a challenge for effective food logo detection with few samples. In addition, we also conduct data statistics on images and objects in the FoodLogoDet-1500 as shown in Fig. 3. Fig. 3 (A) shows the distribution of the number of images for each category, where each category represents each food logo. Fig. 3 (B) shows the distribution of the number of objects for each category. As we can see, there is the imbalance between images and objects in different food logo categories. Fig. 3 (C) provides the number of objects per image. We can draw the conclusion that most images contain one or two logo objects, which is similar to what happens in our real world. Fig. 3 (D) gives the number of objects size in each image. In FoodLogoDet-1500, the large percentage of small and medium food logo objects ( 56%) will pose another challenge to food logo detection, since smaller food logos are harder to detect.

4. Methodology

In this section, we will introduce the proposed Multi-scale Feature Decoupling Network (MFDNet) for food logo detection. Fig. 4 illustrates the architecture of MFDNet, which contains two main components, namely Balanced Feature Pyramid (BFP) and Feature Offset Module (FOM). Specifically, the features of one input food logo image are extracted by ResNet-50 (He et al., 2016)

. Then FPN is employed to fuse multi-scale features and BFP is used for feature refinement in the feature maps. Feature fusion and feature refinement are more effective for multi-scale food logo detection. The region proposal generation step yields a set of region of interests (RoIs) using Region Proposal Network (RPN). Then the RoIs are fed into RoI Pooling layer, in which each RoI is pooled into a fixed-size feature map. Finally, it is divided into classification and regression branches by feature decoupling. FOM is used to disentangle the classification and regression. In the classification branch, FOM utilizes the deformation-learning for optimal offset, which helps us to obtain the most representative features of classification in food logo detection. Then the optimal classification offset is merged with the original features of the network. Finally, feature maps are mapped to a feature vector by a fully connected layer (FC), which is followed by training the final object classifiers and bounding box regressors.

Next, we will focus on two main modules in the MFDNet, namely BFP and FOM.

4.1. Bfp

In object detection, multi-scale features fusion has been a hot topic of research. Deep high-level features in backbones are with more semantic information while the shallow low-level features are more descriptive content  (Zeiler and Fergus, 2014). On that basis, we fuse BFP into MFDNet to better integrate multi-scale features. Different from former methods that integrate multi-level features using lateral connections, the BFP uses the same deeply feature maps to integrate balanced semantic features to strengthen the multi-level features.

To integrate multi-level features while maintaining their semantic hierarchy, we first adjust the multi-level features of FPN outputs to the same size as

. This is achieved using interpolation and maxi-pooling on the other levels to prepare for integration. And then the integrated semantic information is obtained by Eq. 

1.

(1)

where is -th feature maps. The number of multi-level features are denoted as . and are the lowest and highest feature levels, respectively.

Then, the BFP uses non-local module to further refine balanced semantic features. The refining step helps us to enhance the integrated features which are more discriminative. The non-local module is adopted as follows:

(2)

where is the index of an output position whose response is to be computed and is the index that enumerates all possible positions in feature map . is the output of the same size as . computes a scalar between and all . computes a representation of the input at the position j. is the normalization parameter.

After BFP, we can use the feature information of different layers more effectively.

4.2. Fom

The challenge of food logo detection lies in the large-scale categories and similar food logos. Thus, we focus on the problem of classification on food logo detection with large-scale categories. In food logo detection, we are more inclined to extract more expressive semantic regional features in images for large-scale food logo classification. As shown in Fig. 4, different from the original detection head, in FOM, we propose an auto-learned anchor region proposal network for pixel wise offset. FOM is used to search for the best feature extraction for food logo classification.

We used the deformable learning manner to achieve this goal. As shown in Fig. 4, is the output feature map of the RoI pooling layer. RoI pooling divides the RoI into bins and output a feature map . From the , a fully connected layer generates the normalized offsets which are then transformed to the offsets by element-wise product with the RoI’s width and height by Eq. 3. For -th bin, the translation is performed on the sample points in it to obtain the new sample points for . This procedure can be formulated as follows:

(3)

where is a predefined scalar to modulate the magnitude of the , and is the width and height of .

For generating feature maps by irregular , we use the deformable RoI pooling as:

(4)

where is the top-left corner and enumerates all integral spatial locations in the feature map. is the number of pixels in the bin. As the offset is typically fractional, is implemented via the bilinear interpolation.

By disentangling the shared proposal for the classification and regression, FOM is used to search for the best feature for food logo classification. It allows classification tasks to adaptively seek the optimal location in space. This has excellent detection accuracy for large-scale categories and similar food logos.

4.3. Loss Function

In MFDNet, the final loss function is as follows:

(5)

where , and are the losses for RPN, classification and localization, respectively. is the loss for FOM.

Among them, the loss function of the FOM is as follows:

(6)

where is achieved through the cross-entropy loss function. is the feature extractor and is a function for transforming features to predict specific category. is the logo category.

The cross-entropy classification loss function is adopted as follows:

(7)

where is the number of training samples and is number of food logo categories. is the indicator variable. If the sample category is the same as the category of sample , then is 1. is the probability which predict the whether a sample belongs to category .

5. Experiment

5.1. Experimental Setup

Figure 5. Visualization comparison between Double-Head R-CNN and MFDNet on the FoodLogoDet-1500. The first row is Double-Head R-CNN, the second row is MFDNet. Green boxes: ground-truth boxes. Orange boxes: correct detection boxes. Yellow boxes: mistakes detection boxes.

Dataset and evaluation metrics.

To evaluate the effectiveness of the proposed MFDNet, we conduct extensive experiments on our introduced FoodLogoDet-1500 and two standard logo detection datasets, including the FlickrLogos-32 and the QMUL-OpenLogo.

For evaluation, we adopt the widely used mean Average Precision (mAP) (Everingham et al., 2010) and the IoU threshold is 0.5, which means that a detection will be considered as positive if the IoU between the predicted box and ground-truth box exceeds 50%. We also use AP and AP as evaluation standards, which represent the IoU threshold of 0.25 and 0.75, respectively.

Implementation details. We implement our method based on the publicly available mmdetection toolbox (Chen et al., 2019). The Double-Head R-CNN based on ResNet-50 is adopted as the baseline network.

In our experiment, the base detection networks are trained with stochastic gradient descent (SGD). The input images are resized to 800

600 pixels. We train detectors end-to-end with 2 GPUs (2 images per GPU) for 12 epochs. The initial learning rate is set to

. The weight decay of 0.0001 and the momentum of 0.9 are used. Other hyperparameters follow the settings in mmdetection unless otherwise specified.

FOM BFP mAP AP AP
84.5 84.4 82.1
86.0 85.9 84.4
85.1 85.0 83.1
86.6 86.4 85.0
Table 2. Evaluation on individual modules and two modules of MFDNet on FoodLogoDet-1500 (%).
Method mAP AP AP
Faster R-CNN (Ren et al., 2015) 83.9 83.8 81.7
RetinaNet (Lin et al., 2017b) 77.3 77.0 75.3
DCN (Dai et al., 2017) 85.2 85.1 84.2
Cascade R-CNN (Cai and Vasconcelos, 2018) 83.5 83.3 83.2
PANet (Liu et al., 2018) 83.8 83.6 81.5
Libra R-CNN (Pang et al., 2019) 77.8 77.7 76.4
FSAF (Zhu et al., 2019) 83.0 83.5 81.0
Dynamic R-CNN (Zhang et al., 2020a) 75.9 75.5 65.6
Sparse R-CNN (Sun et al., 2020) 83.1 - -
SABL (Wang et al., 2020a) 82.9 83.4 82.0
GRoIE (Rossi et al., 2020) 83.4 83.6 82.1
Generalized Focal Loss (Li et al., 2020) 79.2 79.1 78.5
Double-Head R-CNN (Wu et al., 2020) 84.5 84.4 82.1
ATSS (Zhang et al., 2020b) 80.2 80.0 79.8
FoveaBox (Kong et al., 2020) 75.2 75.0 74.0
Soft-NMS (Bodla et al., 2017) 83.8 83.9 81.8
OHEM (Shrivastava et al., 2016) 84.1 84.2 82.5
Iou loss (Yu et al., 2016) 82.2 82.1 79.5
Generalized IoU (Rezatofighi et al., 2019) 83.3 83.2 80.4
SSD (Liu et al., 2016) 80.4 80.1 78.6
MFDNet 86.6 86.4 85.0
Table 3. Performance comparison on FoodLogoDet-1500 (%).

5.2. Experiment on FoodLogoDet-1500

To enable the benchmark research, we follow the standard setup for data partitions in our experiments. 80%, 20% of images are randomly selected for training and testing in each food logo category.

Ablation Study. For the ablation study, we conduct a comprehensive analysis of the effects of two modules from the MFDNet in the FoodLogoDet-1500. Table 2 shows an ablation study on the effects of different combinations of FOM and BFP in the FoodLogoDet-1500. Two modules are added to Double-Head R-CNN, and the results respectively improve the mAP by 1.5%, 0.6% and 2.1%. These results prove the effectiveness of the FOM in large-scale food logo dataset.

Next, we perform the visualization of the ablation study and analyze the existing detail problem in the Double-Head R-CNN. And then we provide more typical examples in Fig. 5, including the regression bounding box and the classification accuracy. The red box represents the prediction box and the green box is the ground-truth box. Clearly, MFDNet can accurately detect objects with occluded, ambiguous and smaller cases, and it obtains more accurate bounding box regression and classification score. The Double-Head R-CNN makes some detection mistakes, such as misclassified similar food logo categories, and mistaking words into food logo categories. Such as the word ‘Good’ is used as a food logo ‘coors’, because two words are similar. The reason for the above errors is that the large-scale categories of the food logo and the similarity between the food logos are not considered. In contrast, for the detected logos in the middle two images in Fig. 5, our method has an advantage in classification accuracy, and also detect smaller food logos. This shows that FOM can search the best feature for classification, and BFP can fusion of multi-scale information for detection.

Figure 6. Detection results for our proposed MFDNet on the FoodLogoDet-1500. Orange boxes: correct detection boxes.

Comparisons with State-of-the-Arts. In this subsection, we compare the results of our method with other works in FoodLogo-Det-1500. Table LABEL:1500 summarizes the clear performance superiority of MFDNet over all state-of-the-arts with significant mAP, AP and AP improvement. SSD uses VGG-16 (Simonyan and Zisserman, 2015) as a backbone, and other detection models adopt ResNet-50. Compared with existing baselines RetinaNet, Faster R-CNN and Double-Head R-CNN, etc., the proposed method significantly outperforms these state-of-the-art methods. RetinaNet had the limitation in multi-scale object detection. Faster R-CNN adopted the same parameters in two different tasks, which missed the conflict between them in the sibling head, especially the classification of large-scale datasets. MFDNet achieves the best performance with 86.6% mAP. Compared with Double-Head R-CNN, our proposed method gains 2.1% mAP. Specifically, it gains 2.0% AP and gains 2.9% AP. This shows that our proposed method still has good performance with the change of the IoU threshold. MFDNet also surpasses methods of two-stage detectors (Faster R-CNN, Sparse R-CNN and GRoIE), boosting the mAP by 2.7%, 3.5% and 3.2%, respectively. These results validate the advantage of our feature decoupling over existing methods. Some detection results of MFDNet are given in Fig. 6, including the regression bounding box and the classification accuracy. The red box represents the prediction box.

We also set different iterations to compare the convergence and accuracy of models. Fig. 7 shows higher performance with increasing iterations. It can be seen that our method converges at about 100,000 iterations and keeps higher accuracy than Double-Head R-CNN in the training process. This shows that feature decoupling can speed up model convergence.

5.3. Experiment on Other Benchmarks

Besides FoodLogoDet-1500, we also conduct the evaluation on other publicly available benchmark datasets, QMUL-OpenLogo and FlickrLogo-32 to further verify the effectiveness of our method. QMUL-OpenLogo contains 27,083 images from 352 logo categories. In each logo category, 70%, 30% of images are randomly selected for training and testing, respectively (Su et al., 2018). FlickrLogos-32 consists of 2,240 images from 32 logo categories. 80%, 20% of images are randomly selected for training and testing for each logo category. Considering that these baseline experiments only used mAP as the evaluation metric, we also used mAP as the evaluation standard for comparison.

Experiments on QMUL-OpenLogo. We list the experimental results of baselines and our proposed method in Table 4. Our proposed method achieves the best performance with 51.3% mAP. Specifically, MFDNet outperforms the baseline model by 0.4% in mAP. Compared to the anchor-free method FSAF, our method improves the mAP by 6.6%. These results demonstrate the universality of our method on the large-scale logo dataset.

Methods mAP
YOLO9000 (Redmon and Farhadi, 2017) 26.3
ATSS (Zhang et al., 2020b) 48.4
Faster R-CNN (Ren et al., 2015) 51.2
Libra R-CNN (Pang et al., 2019) 51.2
FSAF (Zhu et al., 2019) 44.7
Dynamic R-CNN (Zhang et al., 2020a) 51.2
FoveaBox (Kong et al., 2020) 35.6
Generalized Focal Loss (Li et al., 2020) 46.6
Sparse R-CNN (Sun et al., 2020) 46.9
Double-Head R-CNN (Wu et al., 2020) 50.9
MFDNet 51.3
Table 4. Performance comparison on QMUL-OpenLogo (%).
Figure 7. The comparison of MFDNet and Double-Head R-CNN with increasing iterations.

Experiments on FlickrLogos-32. To better prove the effectiveness of our method, we also carry out experiments on the FlickrLogos-32 with fewer images. Table 5 shows that MFDNet still achieves the best performance compared with other methods, surpassing the Double-Head R-CNN by 0.9% in mAP. However, our model achieves a small margin 0.1% in mAP over the best method Generalized Focal Loss. The probable reason is that FlickrLogos-32 contains fewer logo images and the FOM module thus does not play a decisive role in the dataset.

Methods mAP
Bag of Words (BoW) (Romberg and Lienhart, 2013) 54.5
Deep Logo (Iandola et al., 2015) 74.4
BD-FRCN-M (Oliveira et al., 2016) 73.5
YOLO (Redmon et al., 2016) 68.7
YOLOv3 (Redmon and Farhadi, 2018) 71.7
RetinaNet (Lin et al., 2017b) 78.4
Faster R-CNN (Ren et al., 2015) 83.5
Libra R-CNN (Pang et al., 2019) 84.6
Dynamic R-CNN (Zhang et al., 2020a) 85.8
FoveaBox (Kong et al., 2020) 84.1
Generalized Focal Loss (Li et al., 2020) 86.2
Sparse R-CNN (Sun et al., 2020) 73.7
Double-Head R-CNN (Wu et al., 2020) 85.3
MFDNet 86.2
Table 5. Performance comparison on FlickrLogos-32 (%).

5.4. Discussion

Compared with existing methods, our proposed method obtains better detection performance, especially in solving small food logo objects and large-scale classification. However, it can not achieve high detection performance in some cases. In Fig. 5, the fourth image in the second row shows although the MFDNet improves the detection accuracy of small food logos, there are also missed detections for smaller objects. Therefore, the food logo detection on FoodLogoDet-1500 still has great challenges, such as the problem of the small food logo objects. And it meanwhile highlights the comparative difficulty of the FoodLogoDet-1500.

6. Conclusions

In this paper, we present a new large-scale dataset FoodLogoDet-1500, which is currently the first and largest publicly available food logo detection dataset to the best of our knowledge. In the future, we hope FoodLogoDet-1500 will become a new benchmark food logo dataset, and provide convenience for food logo detection. We then propose a Multi-scale Feature Decoupling Network for food logo detection. Extensive evaluation on FoodLogoDet-1500 and another two standard benchmark logo datasets have verified its effectiveness.

With the rapid development of e-commerce platforms and major food brands, food logo detection will become the trend of future research. We will continue to explore the characteristics of the FoodLogoDet-1500, and generate different benchmarks to evaluate its challenges, such as tiny food logo, serious occlusion and low resolution. Furthermore, we will use transformer (Carion et al., 2020) and lightweight methods to achieve faster and more accurate performance for food logo detection.

Acknowledgements.
This work was supported in part by the National Nature Science Foundation of China (62072289, 61702313, and 61972378), in part by Postdoctoral Science Foundation of China (2017M612338), in part by Shandong science and technology plan project (J17KB177).

References

  • S. Bianco, M. Buzzelli, D. Mazzini, and R. Schettini (2017) Deep learning for logo recognition. Neurocomputing 245, pp. 23–30. Cited by: Table 1.
  • A. Bochkovskiy, C. Wang, and H. M. Liao (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. Cited by: §2.2.
  • N. Bodla, B. Singh, R. Chellappa, and L. S. Davis (2017) Soft-nms–improving object detection with one line of code. In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    pp. 5561–5569. Cited by: Table 3.
  • Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 6154–6162. Cited by: Table 3.
  • N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In Proceeding of the European Conference on Computer Vision, pp. 213–229. Cited by: §2.2, §6.
  • K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, et al. (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §5.1.
  • C. Cortes and V. Vapnik (1995) Support-vector networks. Machine learning 20 (3), pp. 273–297. Cited by: §2.2.
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773. Cited by: Table 3.
  • C. Eggert, A. Winschel, and R. Lienhart (2015) On the benefit of synthetic data for company logo detection. In Proceedings of the ACM International Conference on Multimedia, pp. 1283–1286. Cited by: §1.
  • M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman (2010) The pascal visual object classes (VOC) challenge. International Journal of Computer Vision., pp. 303–338. Cited by: §2.1, §5.1.
  • I. Fehérvári and S. Appalaraju (2019) Scalable logo recognition using proxies. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 715–725. Cited by: Table 1.
  • Y. Gao, F. Wang, H. Luan, and T. Chua (2014) Brand data gathering from live social media streams. In Proceedings of the International Conference on Multimedia Retrieval, pp. 169–176. Cited by: §1.
  • R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448. Cited by: §2.2.
  • R. M. Haralick, K. Shanmugam, and I. H. Dinstein (1973) Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics (6), pp. 610–621. Cited by: §2.2.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969. Cited by: §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §4.
  • S. C. H. Hoi, X. Wu, H. Liu, Y. Wu, H. Wang, H. Xue, and Q. Wu (2015) LOGO-net: large-scale deep logo detection and brand recognition with deep region-based convolutional networks. IEEE Transactions on Pattern Analysis & Machine Intelligence 46 (5), pp. 2403–2412. Cited by: §2.1, Table 1.
  • F. N. Iandola, A. Shen, P. Gao, and K. Keutzer (2015)

    Deeplogo: hitting logo recognition with the deep neural network hammer

    .
    arXiv preprint arXiv:1510.02131. Cited by: §2.2, Table 5.
  • B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang (2018) Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision, pp. 784–799. Cited by: §2.2.
  • X. Jin, W. Su, R. Zhang, Y. He, and H. Xue (2020) The open brands dataset: unified brand detection and recognition at scale. In Proceeding of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4387–4391. Cited by: §2.1, Table 1.
  • A. Joly and O. Buisson (2009) Logo retrieval with a contrario visual query expansion. In Proceedings of the ACM International Conference on Multimedia, pp. 581–584. Cited by: Table 1.
  • Y. Kalantidis, L. G. Pueyo, M. Trevisiol, R. van Zwol, and Y. Avrithis (2011) Scalable triangulation-based logo recognition. In Proceedings of the ACM International Conference on Multimedia Retrieval, pp. 1–7. Cited by: §1, Table 1.
  • T. Kong, F. Sun, H. Liu, Y. Jiang, L. Li, and J. Shi (2020) Foveabox: beyound anchor-based object detection. IEEE Transactions on Image Processing 29, pp. 7389–7398. Cited by: Table 3, Table 4, Table 5.
  • H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision, pp. 734–750. Cited by: §2.2.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §2.2.
  • P. Letessier, O. Buisson, and A. Joly (2012) Scalable mining of small visual objects. In Proceedings of the ACM International Conference on Multimedia, pp. 599–608. Cited by: Table 1.
  • X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang (2020) Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. arXiv preprint arXiv:2006.04388. Cited by: Table 3, Table 4, Table 5.
  • Y. Liao, X. Lu, C. Zhang, Y. Wang, and Z. Tang (2017) Mutual enhancement for detection of multiple logos in sports videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4846–4855. Cited by: Table 1.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017a) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125. Cited by: §2.2.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017b) Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988. Cited by: Table 3, Table 5.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Proceeding of the European Conference on Computer Vision, pp. 740–755. Cited by: §2.1.
  • S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia (2018) Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768. Cited by: §2.2, Table 3.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In Proceedings of the European Conference On Computer Vision, pp. 21–37. Cited by: §2.2, Table 3.
  • D. G. Lowe (1999) Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE International Conference on Computer Vision, Vol. 2, pp. 1150–1157. Cited by: §2.2.
  • M. Mas, L. Qian, A. Jan, and E. J. Delp (2018) Logo detection and recognition with synthetic images. Electronic Imaging 2018 (10), pp. 3371–3377. Cited by: Table 1.
  • W. Min, S. Jiang, L. Liu, Y. Rui, and R. Jain (2019) A survey on food computing. In ACM Computing Surveys 52 (5), pp. 1–36. Cited by: §1.
  • G. Oliveira, X. Frazão, A. Pimentel, and B. Ribeiro (2016) Automatic graphic logo detection via fast region-based convolutional networks. In 2016 International Joint Conference on Neural Networks, pp. 985–991. Cited by: Table 5.
  • J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin (2019) Libra r-cnn: towards balanced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 821–830. Cited by: §2.2, Table 3, Table 4, Table 5.
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788. Cited by: §2.2, Table 5.
  • J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271. Cited by: §2.2, Table 4.
  • J. Redmon and A. Farhadi (2018) YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §2.2, Table 5.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §2.2, Table 3, Table 4, Table 5.
  • J. Revaud, M. Douze, and C. Schmid (2012) Correlation-based burstiness for logo retrieval. In Proceedings of the ACM International Conference on Multimedia, pp. 965–968. Cited by: §1.
  • H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 658–666. Cited by: Table 3.
  • S. Romberg and R. Lienhart (2013) Bundle min-hashing for logo recognition. In Proceedings of the ACM Conference on International Conference on Multimedia Retrieval, pp. 113–120. Cited by: §1, Table 5.
  • S. Romberg, L. G. Pueyo, R. Lienhart, and R. Van Zwol (2011) Scalable logo recognition in real-world images. In Proceedings of the ACM International Conference on Multimedia Retrieval, pp. 1–8. Cited by: 1st item, §2.1, Table 1.
  • L. Rossi, A. Karimi, and A. Prati (2020) A novel region of interest extraction layer for instance segmentation. arXiv preprint arXiv:2004.13665. Cited by: Table 3.
  • H. Sahbi, L. Ballan, G. Serra, and A. Del Bimbo (2012) Context-dependent logo matching and recognition. IEEE Transactions on Image Processing 22 (3), pp. 1018–1031. Cited by: Table 1.
  • J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §2.2.
  • A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769. Cited by: Table 3.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations, pp. 1–14. Cited by: §5.2.
  • G. Song, Y. Liu, and X. Wang (2020) Revisiting the sibling head in object detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11563–11572. Cited by: §2.2.
  • H. Su, S. Gong, and X. Zhu (2017a) Weblogo-2m: scalable logo detection by deep learning from the web. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 270–279. Cited by: Table 1.
  • H. Su, X. Zhu, and S. Gong (2017b) Deep learning logo detection with data expansion by synthesising context. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 530–539. Cited by: §2.1, Table 1.
  • H. Su, X. Zhu, and S. Gong (2018) Open logo detection challenge. arXiv preprint arXiv:1807.01964. Cited by: 1st item, §2.1, Table 1, §5.3.
  • P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang, et al. (2020) Sparse r-cnn: end-to-end object detection with learnable proposals. arXiv preprint arXiv:2011.12450. Cited by: Table 3, Table 4, Table 5.
  • Z. Tian, C. Shen, H. Chen, and T. He (2019) Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9627–9636. Cited by: §2.2.
  • A. Tüzkö, C. Herrmann, D. Manger, and J. Beyerer (2017) Open set logo detection and retrieval. arXiv preprint arXiv:1710.10891. Cited by: Table 1, §3.
  • J. Wang, W. Zhang, Y. Cao, K. Chen, J. Pang, T. Gong, J. Shi, C. C. Loy, and D. Lin (2020a) Side-aware boundary localization for more precise object detection. In Proceeding of the European Conference on Computer Vision, pp. 403–419. Cited by: Table 3.
  • J. Wang, W. Min, S. Hou, S. Ma, Y. Zheng, and S. Jiang (2020b) LogoDet-3k: a large-scale image dataset for logo detection. arXiv preprint arXiv:2008.05359. Cited by: §2.1, Table 1.
  • J. Wang, W. Min, S. Hou, S. Ma, Y. Zheng, H. Wang, and S. Jiang (2020c) Logo-2k+: a large-scale logo dataset for scalable logo classification. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 34, pp. 6194–6201. Cited by: §2.1, Table 1.
  • Y. Wu, Y. Chen, L. Yuan, Z. Liu, L. Wang, H. Li, and Y. Fu (2020) Rethinking classification and localization for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10186–10195. Cited by: §2.2, Table 3, Table 4, Table 5.
  • L. Xie, Q. Tian, W. Zhou, and B. Zhang (2014) Fast and accurate near-duplicate image search with affinity propagation on the imageweb. Computer Vision & Image Understanding 124, pp. 31–41. Cited by: Table 1.
  • J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang (2016) Unitbox: an advanced object detection network. In Proceedings of the 24th ACM International Vonference on Multimedia, pp. 516–520. Cited by: Table 3.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision, pp. 818–833. Cited by: §4.1.
  • H. Zhang, H. Chang, B. Ma, N. Wang, and X. Chen (2020a) Dynamic r-cnn: towards high quality object detection via dynamic training. In Proceedings of the European Conference on Computer Vision, pp. 260–275. Cited by: Table 3, Table 4, Table 5.
  • S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li (2020b) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9759–9768. Cited by: Table 3, Table 4.
  • C. Zhu, Y. He, and M. Savvides (2019) Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 840–849. Cited by: Table 3, Table 4.