Large Scale Open-Set Deep Logo Detection

We present an open-set logo detection (OSLD) system, which can detect (localize and recognize) any number of unseen logo classes without re-training; it only requires a small set of canonical logo images for each logo class. We achieve this using a two-stage approach: (1) Generic logo detection to detect candidate logo regions in an image. (2) Logo matching for matching the detected logo regions to a set of canonical logo images to recognize them. We also introduce a 'simple deep metric learning' (SDML) framework that outperformed more complicated ensemble and attention models and boosted the logo matching accuracy. Furthermore, we constructed a new open-set logo detection dataset with thousands of logo classes, and will release it for research purposes. We demonstrate the effectiveness of OSLD on our dataset and on the standard Flickr-32 logo dataset, outperforming the state-of-the-art open-set and closed-set logo detection methods by a large margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

page 10

10/30/2017

Open Set Logo Detection and Retrieval

Current logo retrieval research focuses on closed set scenarios. We argu...
05/12/2022

Localized Vision-Language Matching for Open-vocabulary Object Detection

In this work, we propose an open-world object detection method that, bas...
05/27/2020

Few-Shot Open-Set Recognition using Meta-Learning

The problem of open-set recognition is considered. While previous approa...
09/06/2021

Zero-Shot Open Set Detection by Extending CLIP

In a regular open set detection problem, samples of known classes (also ...
12/14/2020

The Open Brands Dataset: Unified brand detection and recognition at scale

Intellectual property protection(IPP) have received more and more attent...
01/07/2022

Extending One-Stage Detection with Open-World Proposals

In many applications, such as autonomous driving, hand manipulation, or ...
10/12/2021

Open-Set Recognition: A Good Closed-Set Classifier is All You Need

The ability to identify whether or not a test sample belongs to one of t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Logo detection is the task of localizing and identifying brand logos in images (Figure 1), with practical applications such as brand protection, brand-aware product search and recommendation. It is a difficult task, even for humans, since it is usually hard to tell whether a graphics pattern or piece of text belongs to a logo or not without prior knowledge of the original logo. Well-known logos are usually easy to spot and identify using prior knowledge, while unknown logos may be hard to recognize.

Traditionally, logo detection is treated as a closed-set object detection problem, in which the system is trained for a predefined set of logo classes and can only recognize those classes at test time. It has been shown to work well for a small number of logo classes [8, 4], e.g., 32 logo classes in the Flickr-32 logo dataset [21], given sufficient number of training samples for each class. However, closed-set logo recognition has the following major shortcomings that makes it unsuitable large-scale real world logo detection.

  • It can not recognize new logo classes without re-training, which is expensive.

  • It is not scalable to a realistic number of logo classes (METU logo dataset [24] has  410K logo classes).

  • It is very expensive to gather and annotate sufficient number of logo images to properly train the detection/recognition systems.

Figure 1: Our system can detect logos (localize with a bounding box and identify its brand) in images even when the logo class is not present in the training set (logo detection). It only needs a set of canonical logo images for each logo class (Figure 3).

logo detection [25, 7] addresses these shortcomings. It can detect and recognize new logo classes without re-training. It is relatively scalable to large number of logo classes. There is no need to annotate large number of images for each and every new logo class, it is sufficient to provide only the canonical logo images (Figure 3) for the logo classes to be recognized. Therefore, open-set logo detection is much better suited to real-world logo detection scenarios, where there are a large number of brand logos which might be changing over time and new logo classes might be arriving continuously.

Motivated by this, we present an logo detection system, inspired by the recent success of similar systems for face recognition 

[27] and person re-identification [29] tasks. Logo detection is similar in that we can formulate it as a two-stage problem: generic logo detection, followed by logo matching, which is feasible as there are typically a small number of canonical logo images for each brand logo (Figure 3).

We have the following major contributions.

  • An end-to-end open-set logo detection system (OSLD) that outperformed previous open-set and closed-set methods by a large margin.

  • A ‘simple deep metric learning’ framework (SDML) that outperformed existing deep metric frameworks by a large margin (Section 4.1) and helped boost the performance of OSLD.

  • A new logo detection dataset with thousands of logo classes (Section 5), to be released for research purposes.

2 Related Work

Logo detection is a special case of object detection, which comprises object localization and recognition. Traditionally, logo detection is treated as a closed-set object detection problem, in which the system is trained on a small number of pre-defined logo classes. Early systems relied on hand-engineered local keypoint features [9, 11, 21, 20]

. Recently, deep learning based methods have been dominant in closed-set logo detection 

[2, 6, 1, 16, 23, 22, 3]. They trained Faster R-CNN [18] object detector, or used logo region proposals followed by classification with CNN features. These methods require a large amount of memory for large number of output classes and cannot detect new logo classes without re-training. It is also costly to gather and annotate sufficient amount of training data. In [22]

, the authors present a partial solution to the dataset construction problem. They start with a small amount of training data to train a model and progressively expand the dataset by running the model on web images and retaining the high confidence detections (semi-supervised learning). This way, they built the WebLogo-2M dataset with 1.9M images, but with only 194 logo classes.

Open-set logo detection takes a two-stage approach: generic logo detection followed by logo matching using learned CNN embeddings. This is similar to the recent deep learning based face recognition [27] and person re-identification [29] approaches, which achieved impressive performance. In [25]

, Faster R-CNN is used to first localize candidate logo regions. Then, another CNN, pre-trained on ImageNet and fine-tuned on the logo dataset is used to extract features from the candidate logo regions and match to the database. In 

[7], the authors employed the so-called Proxy-NCA metric learning approach to learn better embeddings to match candidate logo regions to canonical logo images. Trained on a large (295K product images with 2K logo classes, downloaded from the web), they also demonstrated points improvement in mAP on the Flickr-32 dataset over the previous works, without re-training or fine tuning. Our framework (Figure 2) is similar, but with better metric learning and training strategies, outperforming the previous methods by a large margin.

3 Method

Inspired by the high performance of two-stage deep metric learning based approaches, as in face recognition and person re-identification, we take a two-stage approach to logo detection, as shown in Figure 2. The first stage, logo detection (Section 3.1), localizes candidate logo regions using a generic logo detector, which should have high recall, but may have low precision. The second stage, logo matching (Section 3.2), matches the candidate logo regions to the set of all canonical logo images for all logo classes to be recognized. The matching module is trained to assign high score to the correct matches and low score to the false matches. The detected logo regions are labeled with the best matching logo class.

Whenever a new logo class needs to be recognized, it is sufficient to add its canonical logo images (CNN representations) to the logo database and no further data collection, annotation and training are required. This flexibility is very useful in real-world logo detection applications.

Figure 2: Our logo detection framework. Generic logo detector outputs candidate logo regions, which are cropped from the image and then matched to a set of canonical logo images using low dimensional CNN embeddings learned for matching logo images.

3.1 Generic Logo Detection

Generic logo detection localizes candidate logo regions with bounding boxes, without identifying their class. Hence, a binary logo detector with two outputs (logo, background) is sufficient. The detector should have high recall. Due to the ambiguity of the logo detection problem, we expect to detect false logo candidates, typically text, graphics or graphics+text regions, which are filtered out in the second stage when they can not be matched to any canonical logo image in the database.

We used the RFBNet [13] object detector due to its high speed and accuracy, with input resolution (most of the images in our dataset has this or lower resolution) and VGG16 base network. It is a single stage object detector based on SSD [14]. We trained the RFBNet using a logo dataset labeled with bounding boxes. The class labels are not needed, as we only need a binary (generic) logo detector. This makes the annotation much cheaper, compared to closed-set approaches which also need the class labels for each bounding box.

We tuned the anchor box sizes and aspect ratios of the RFBNet based on the bounding box statistics of the training set. We used a pre-trained VGG16 network, first frozen the base convolutional layers and trained only the additional layers with Adam optimizer with a learning rate of , then trained the whole network with an initial learning rate of , reduced at and epochs by half, for 50 epochs. We saved the best model with the best validation accuracy measured by mean average precision (mAP).

3.2 Logo Matching

The goal of logo matching is to match the candidate logo regions from the generic logo detection to canonical logo images in the database, to identify the class of each logo region or discard it if it does not match to any logo image with high confidence. In Figure 2, the generic logo detector localized four candidate logo regions, three of which are actual logo regions and one is false. The logo matching module takes the cropped candidate logo regions and tries to match each one to all the canonical logo images in the database, using their -normalized CNN embeddings and Euclidean or cosine distance. The best matching canonical logo class is assigned as the label (‘Puma’, ‘Ferrari’ in the figure) if the matching score is above a threshold. The non-logo regions are discarded as they do not match to any canonical logo image with high score.

We explored and evaluated two similar approaches to logo matching.

  • Deep Metric Learning (DML). We explored and improved the standard triplet loss and binomial deviance loss [17] networks to learn CNN embeddings for matching cropped logo regions to canonical logo images, trained on annotated image pairs. Details are in Section 4.

  • Semi-Geometric Matching (SGM). Inspired by [19], we used a Siamese network followed by pixel-wise matching of feature maps and a CNN trained with cross entropy loss.

    In [19]

    , a CNN architecture is proposed to estimate the geometric correspondence between two images. First, the images go through a base CNN network to obtain low resolution feature maps, e.g.,

    . The feature maps are matched by a ‘correlation layer’, which matches each pixel features of the first image to all the other pixel features of the second image, producing a correlation map for the two images. The idea is to find all possible matches between the two feature maps, even if there is some transformation, e.g., affine, between them. Finally, the correlation map is used to estimate the parameters of the geometric transformation between the two images, using a regression network.

    Our task in this work is to determine whether two logo images match or not; we do not need the parameters of the transformation. Therefore, we slightly modified the original network architecture of  [19]

    . Instead of the regression network that estimates the parameters of the transformation, we used a simple 3-layer CNN (64 conv5, relu, 128 conv5, relu, 256 conv3, relu, linear) trained with cross entropy loss to decide whether the two logo images match or not, similar to binary classification. We trained the whole network on the same pairs/triplets as in DML. We named this method ‘semi’-geometric matching, since it does not estimate the parameters of the geometric transformation as in the original work.

4 Deep Metric Learning

Metric learning for images aims to learn a low dimensional image embedding that maps similar images closer in the embedding space and dissimilar ones farther apart. Deep metric learning (DML) using CNN embeddings has achieved high performance on image retrieval 

[17, 12], face recognition  [27], person re-identification [29], to name a few. In this section, we will present a few simple tricks to further improve the existing deep metric learning methods and outperform more complicated ensemble or attention models, using only vanilla CNNs. We name our simple DML pipeline ‘Simple Deep Metric Learning’ (SDML). Then, we leverage this SDML framework to boost the logo matching accuracy.

Deep metric learning relies on positive/negative pairs or triplets of images to optimize an objective/loss function. Contrastive loss, triplet loss, binomial deviance loss are commonly used loss functions with many variations. Triplet and contrastive losses are commonly used DML loss functions, although their gradients are not continuous. Binomial deviance loss 

[17, 15] is similar to contrastive loss, but with continuous gradients. Here we consider the triplet and binomial deviance loss functions.

Let be a triplet of anchor, positive and negative images, and let be their CNN embeddings, e.g., the output of the last fully connected layer. Following [17, 15]

, we use the cosine similarity,

, between the image embeddings .

(1)

The cosine similarity lies in and enforces an upper bound on the loss value, possibly helping the optimization process.

The triplet loss, , is defined as follows.

(2)

where, is the margin value, and are cosine similarities between anchor-positive and anchor-negative image embeddings.

Similarly, the binomial deviance loss (bindev), , is defined as follows.

(3)
(4)
(5)

where, and are scaling and translation (margin) parameters. This is slightly different from the formulation in [17], which uses a cost parameter to balance the positive and negative pairs in the loss. We omit this term as we use a balanced sampling strategy, as explained in the next section.

4.1 Simple Deep Metric Learning (SDML)

We now present some simple tricks ro best practices that we empirically found to work well in deep metric learning and enable our simple deep metric learning (SDML) framework to outperform complicated DML models that employ ensembles and/or attention.

  • Sampling. Sampling good pairs/triplets is known to be very important in training deep metric networks [28]. Random sampling is simple, but leads to easy pairs which are not helpful in the learning process. Hard negative mining (HNM) is a commonly used solution in such cases. However, HNM is typically applied within a batch, which is sampled randomly. We recommend balanced random sampling followed by hard negative mining (BHNM). First sample anchor images, and positive and negative images (total ). Then, for each positive pair in the batch, find the hardest negative pair (with the highest loss) in the batch, and update the negative pairs in the triplets. This strategy produces equal number of positive and negative pairs (hence no need to use a balancing factor in the loss function as in [17]), and makes better use of the available data.

  • Preprocessing and Data Augmentation. The input images to CNNs are typically resized to a square size, e.g.,

    , for efficient mini-batch processing. Directly resizing to square size may lead to large distortions in the image when the aspect ratio is either large or small. In such cases, padding the input image to square size should be preferred to avoid distortion, and this affects the performance significantly.

    For data augmentation, random resized crop and flipping are typically used. The parameters of random resized crop should be tuned carefully according to the dataset, otherwise it will adversely affect the performance, e.g., including very small scale crops which may end up being the background or cause large object scale discrepancy between training and test sets, or wide range of aspect ratios which will heavily distort the images.

  • Optimization.Stochastic gradient descent (SGD) and Adam are commonly used optimizers in DML. We recommend the Adam optimizer with an initial learning rate of around - and decreasing learning rate scheduler, which we found to provide fast and stable training with good results. Larger learning rates may lead to unstable training and/or inferior performance; smaller learning rates, e.g., as used in [17], will need long training times and may get stuck at local minima.

4.2 SDML Results

Now, we demonstrate the effectiveness of our SDML method by comparing it with the state-of-the-art DML methods that employ complicated ensemble and/or attention models. We performed experiments on one of the standard deep metric learning datasets, CUB-200-2011 [26], which is considered to be the hardest of four standard metric learning datasets (CUB, SOP, Cars-196, DeepFashion) with the lowest retrieval accuracy [28, 17, 12].

Method Base CNN Image Size Emb Size R@1 mAP
BinDev [17] (PAMI 18) GoogleNet 224 512 58.9 -
A-BIER [17] (PAMI 18) GoogleNet 224 512 65.5 -
ABE [12] (ECCV 18) GoogleNet 224 512 70.6 -
SDML (triplet) GoogleNet 224 512 77.3 64.5
SDML (bindev) GoogleNet 224 512 78.6 67.9
Margin [28] (ICCV 17) ResNet50 256 128 63.9 -
CGD [10] (arxiv 19) ResNet50 224 128 67.6 -
BFE [5] (ICCV 19) ResNet50 256 1536 74.1 -
CGD [10] (arxiv 19) ResNet50 224 1536 76.8 -
CGD [10] (arxiv 19) SE-ResNet50 224 1536 79.2 -
SDML (bindev) ResNet50 224 128 81.7 74.8
SDML (bindev) DenseNet169 224 512 84.5 78.1
Table 1: Comparison of our SDML with state-of-the-art methods on cropped CUB-200-2011 dataset. The methods are grouped by base CNN architecture and embedding size.

We first review the latest state-of-the-art methods on DML, and then compare our SDML with them. In [28], a distance weighted sampling strategy was proposed to select more informative training examples and was shown to improve retrieval. Opitz et al. [17]

introduced the binomial deviance loss to deep metric learning, as an alternative to triplet loss. They also proposed a DML network, BIER/A-BIER, with embedding ensemble, online gradient boosting and additional diversity loss functions. Later, Kim et al. 

[12] improved the ensemble model of BIER with attention, attention based ensemble (ABE), to attend to different parts of the image and a divergence loss to encourage diversity among the learners.

More recently, Dai et al. [5], proposed a simple batch feature erasing method (BFE), originally for person re-identification, and demonstrated significant improvement on DML datasets without using complicated ensembles or attention. Finally, Jun et al. [10] proposed a framework to combine multiple global descriptors (CGD) to obtain the benefits of ensemble methods, achieving the highest accuracy on all datasets. Currently, BFE and CGD are the best performing DML methods, but they use better base CNNs (ResNet50, SE-ResNet50 variants) and larger embedding size (1536), which effect the performance significantly. For fair comparison, the base CNNs and embedding sizes should be the same.

SDML Setting. We grouped the existing DML methods based on their base CNNs and embedding size and used the same settings for a fair comparison. We use ImageNet pre-trained base CNNs. We first fine tuned the embedding layer for epochs with a learning rate of , and then fine tuned the whole network for a maximum of 100 epochs, with an initial learning rate of , halved four times at every 10 epochs. We used Adam optimizer; and in the loss functions. Following common practice, we evaluated at every 5 epochs and reported the best accuracy, which we observed to be repeatable over multiple runs.

We used the balanced random sampling with hard negative mining (BHNM) as described above with . We padded the input images with zeros to square size and used random horizontal flip and random resized crop with scale in [0.8, 1.0], aspect ratio in [0.9, 1.1] as data augmentation. At test time, we padded to square size with zeros, resized to and center cropped.

Comparison. Table 1 compares our SDML with the state-of-the-art DML methods reviewed above, using the commonly used Recall@K performance measure (percentage of queries with at least one relevant result in top K). We also reported mean average precision (mAP) values for SDML, as it is a better performance metric for ranked image retrieval. The results are grouped and sorted by base CNN and embedding size.

With GoogleNet and embedding size , the baseline binomial deviance loss as presented in [17] reported R@1, and earlier best was R@1 by ABE [12]. Our SDML achieved with binomial deviance loss and with triplet loss. This is higher than the original binomial deviance baseline, and higher than the earlier state-of-the-art, ABE. BinDev and triplet loss achieved similar R@1, but bindev achieved a better mAP.

With ResNet50 and embedding size 128, earlier best was R@1 by CGD [10]; our SDML achieved with binomial deviance loss (triplet loss is similar), higher. This is even higher than that of BFE and CGD with embedding size. Our best R@1 is with a DenseNet169 having an additional fully connected layer of size , using binomial deviance loss. Overall, our SDML outperforms the state-of-the-art ensemble and attention models by a large margin using simple CNN models.

Ablation Study. We performed ablation studies to find out the impact of sampling, preprocessing, data augmentation and optimization in our SDML framework and presented the results in Appendix B. Based on these results, the largest contribution comes from the balanced sampling with hard negative mining (BHNM), resulting in an improvement of +15 points in R@1 over the binomial deviance loss baseline in [17]. With random sampling (without HNM), the accuracy is significantly lower (about points) and last convolutional layer features work better than the embedding layer features; with hard negative mining, embedding layer features are slightly better. The optimizer and preprocessing also provide modest improvements. We provide more detailed analysis in the Appendix.

Next, we return to our original problem of open-set logo detection and employ our findings in SDML to boost logo matching accuracy.

5 Datasets

We need a large scale dataset with a large number of logo classes to evaluate our logo detection framework. There is no such public dataset, therefore, we built our own dataset, BLAC, which will be released for research purposes.

BLAC (Brand Logos in the Amazon Catalog). BLAC dataset consists of product images downloaded from Amazon (Figure 4). We downloaded randomly selected 100K images from the Amazon catalog and annotated them through Amazon Mechanical Turk. About 21.5K of these images have one or more logos in them. The logo class of the main product in the catalog image was retrieved from the product page. We obtained two types of annotations to train our detection and matching networks:

  • Logo bounding boxes (bbox), without class labels, for all 21.5K images with 11K logo classes (hence, each logo class appears in only two images on the average). We used this dataset to train and evaluate the generic logo detection. Note that it would not be possible to train a closed-set logo detection system on this dataset, as there are only two instances of each logo class on the average.

  • From the 21.5K images, about 6.2K logo bounding box to canonical logo image pairs, for 2.8K logo classes.

Each image was annotated by multiple annotators and bbox coordinates were consolidated by taking the median values. The annotators also downloaded the canonical logo images for each logo class, which are required for logo matching. We used this dataset to train and evaluate the logo matching. Like all real-world, crowd sourced annotations and due to the ambiguity and difficulty of the annotation task, the annotations contain some errors, but it is not significant.

We split the dataset into training, validation and test sets, in a strictly open-set setting, i.e., training, validation and test set logo classes are totally different and there is no logo class overlap between the splits. For the logo detector, we used 16.5K images for training, 2K for validation and 3K for testing. For matching, we used 4.2K pairs with 2.4K classes for training, 500 pairs with 176 classes for validation and 1500 pairs with 248 classes for testing. Note that, the test classes are never used in training/validation ().

Flickr-32 Logos. This is a commonly used logo dataset with 32 logo classes [21]. Released in 2011, prior to deep learning era, it has a tiny training set of 360 images; test set has 3960 images, 960 of which contain logos. We used this dataset to compare to earlier logo detection methods and also measure how transferrable our logo detection system is to other datasets.

As can be seen in Figure 4, the two datasets are quite different in terms of image content, clutter, logo size and transformations.

Figure 3: Canonical logo images. There might be significant differences between the versions of the same brand logo.

BLAC dataset

Flickr-32 dataset

Figure 4: Sample images from the BLAC and Flickr-32 logo datasets.

6 Results

6.1 Generic Logo Detection

We trained and evaluated the generic logo detection network (RFBNet) on the BLAC detection training set (16.5K), as described in Section 3.1. We evaluated the performance using the standard PASCAL VOC object detection evaluation procedure with mean average precision (mAP) and recall as the performance measures: a detection is correct if it has at least 50% intersection over union (IoU) with a ground truth bounding box. We evaluated on BLAC test set and also Flickr-32 test set; Table 2 shows the results. On BLAC test set, mAP is , and recall is , which are very good considering the quality of the annotations and ambiguity of the task.

Using the RFBNet model trained on BLAC training set, the mAP on Flickr-32 test set is , which is much lower than that on BLAC test set, but recall () is still quite good. We also fine tuned the BLAC RFBNet model on the tiny Flickr-32 training set (with only 360 images), the mAP improved to and recall to . Based on these results, we conclude that there is a significant difference between the two datasets, in terms of generic logo detection.

Dataset mAP Recall
BLAC 71.0 91.2
Flickr-32 34.7 77.5
Flickr-32 49.3 83.5
Table 2: Generic logo detection (localization) performance with RFBNet [13] using PASCAL VOC 50 % IoU criterion, trained on BLAC detection training set. RFBNet fine tuned on Flickr-32 training set (360 images).

6.2 End-to-End Logo Detection

Here, we present the performance of the end-to-end detection: given an input image, RFBNet localizes candidate logo regions, which are matched to the canonical logo images using SDML or SGM, as shown in Figure 2. For each candidate logo region, we consider only the top-1 match returned by logo matching. We use standard PASCAL VOC object detection evaluation procedure with mAP measure: a detection is correct if it has at least 50% IoU with a ground truth bounding box and its label is correct. We also report image-based mAP, to compare with the earlier methods all of which used image-based mAP, in which a detection is correct if its label matches to one of the ground truth labels in the image; for each logo class we take the detection with the maximum score (single detection per logo class).

Note that the evaluations on both BLAC and Flickr-32 are done in a strictly open-set setting, i.e., the test set logo classes were never seen during training.

Based on the SDML experiments, we selected the DenseNet169 model with one extra FC layer of size and embedding size , and input image size . To account for the color variations of the logos, we additionally used color jitter (hue=0.1, brightness=0.1, contrast=0.1, saturation=0.1) and random grayscale conversion as data augmentation. Moreover, to account for the false positive logo candidates during matching, we included the false positives from generic logo detection as negative pairs during triplet sampling. Other training parameters are the same.

Dataset Augmentation. BLAC training set for logo matching has annotated pairs. Larger training sets can potentially improve the matching accuracy, but manual annotation is expensive. As a remedy, we employed semi-supervised learning (pseudo labeling) to expand the dataset. We trained an initial model (with binomial deviance loss) using the available training data, then run this model on unlabeled logo images and retained the matching pairs with high matching score (cosine similarity larger than ). With this, the augmented training set size increased to .

Results on BLAC. The end-to-end detection and matching performances on BLAC test set (1500 images, 248 logo classes) for different matching methods are shown in Table 3. SDML outperformed SGM by a large margin in both image-based and bbox-based evaluations, leading to the conclusion that metric learning is better at learning embeddings for image matching; it is also much more efficient at test time (SGM needs to run the matching network against all the database images’ CNN feature maps, while DML network runs only on the test image and performs nearest neighbor search against the pre-computed CNN embeddings of database images).

Binomial deviance loss and triplet loss performed similarly. SGM+ and SDML+ models were trained on the augmented training set (6.5K pairs), which provided a significant performance boost in all methods, indicating that there is still room for improvement with larger training sets. Overall, SDML+ with triplet loss achieved an image-based mAP of and bbox-based mAP of .

Method Image-based BBox-based
SGM 64.7 47.0
SGM+ 70.5 51.0
SDML (triplet) 79.0 55.4
SDML+ (triplet) 84.3 59.5
SDML (bindev) 79.8 56.3
SDML+ (bindev) 84.3 59.0
Table 3: End-to-end bbox and image-based detection performance (mAP) on the BLAC test set. SGM: semi-geometric matching. SDML: simple deep metric learning. SGM+/SDML+: SGM/SDML trained on the augmented training set.
Method Training Data Image-based BBox-based
Closed-Set Fast R-CNN [16] Flickr-32 73.5 -
Faster R-CNN [23] Flickr-32 + Synthetic 81.1 -
Faster R-CNN [1] Flickr-32 84.2 -
Open-Set Proxy-NCA (top-1) [7] PL2K 44.4 -
Faster R-CNN [25] Flickr-32 + LitW 46.4 -
Proxy-NCA (top-5) [7] PL2K 56.6 -
SDML+ (triplet) BLAC 89.6 51.5
SDML+ (bindev) BLAC 88.9 50.4
SDML+ (triplet) BLAC 90.9 64.7
SDML+ (bindev) BLAC 90.5 63.7
Table 4: Comparison of our logo detection method (SDML+) with the state-of-the-art open-set and closed-set logo detection methods. All values are mAP. Our RFBNet generic logo detector fine-tuned on Flickr-32 training set.

Results and Comparison on Flickr-32. We performed experiments on Flickr-32 test set (3960 images, 32 logo classes), using the models trained on BLAC augmented training set (SDML+) and compared with the state-of-the-art open-set and closed-set logo detection methods. There is no overlap between the logo classes of BLAC training set and Flickr-32 test set (open-set setting). The results and comparison are shown in Table 4. We reported both image-based and bbox-based mAP; the earlier methods reported only image-based mAP values. We used the last convolutional layer features in logo matching, which worked slightly better than the embedding layer features.

We compared with two existing open-set logo detection approaches [25, 7], and three closed-set approaches [16, 23, 1], all of which are based on the Fast/er R-CNN object detection. Some of these methods trained their models on their own dataset or an expanded version of Flickr-32, since Flickr-32 training set is tiny (only 360 images), making it difficult to train deep networks. Hence, there is some discrepancy in both the training datasets and the networks used among the earlier methods. This is mainly due to the lack of a publicly available large-scale benchmark dataset for logo detection; our dataset will fill in this gap.

Without any fine tuning on the Flickr-32 dataset, our SDML+ with triplet loss achieved an image-based mAP of , which outperformed the existing best open-set method [7] by a large margin ( points). It even outperformed the best closed-set method by points, clearly demonstrating its effectiveness.

When we fine tune the RFBNet generic logo detection on the small Flickr-32 training set, the image-based mAP improved to ( points), and bbox-based mAP improved from to ( points). This indicates that generic logo detection is less transferrable between the two datasets, while logo matching transfers much better. Moreover, in contrast to the BLAC test results, last convolutional layer features worked slightly better ( points in bbox mAP) than the embedding layer features on Flickr-32 test set. This indicates that last convolutional layer features are more transferrable across the two datasets than the embedding layer features. We could not fine tune SDML+ on Flickr-32 for matching, as it does not have labeled pairs.

We presented end-to-end logo detection samples from BLAC and Flickr-32 test sets in Appendix A. Our detector is able to detect and recognize transformed logos, even though the canonical logos are upright. The matching network is able to learn how to match transformed logo instances with the help of such samples in the training set. We also experimented with additional data augmentations (random rotations, affine transformations), but did not observe performance improvement.

Discussion. The major performance boost in our logo detector comes from the SDML+ training and matching. The earlier open-set logo detection method in [25] used Faster R-CNN for detection and the outputs of a classification network for matching, which is not optimal for matching. Proxy-NCA method [7] used Faster R-CNN for detection and Proxy-NCA metric learning; trained their networks on a very large dataset of product images (296K images, 2K logo classes), which resulted in points lift in the mAP over [25].

Our method is most similar to [7], which also trained on a dataset of product images downloaded from Amazon, but 14 times larger than ours (296K vs 21.5K); their Faster R-CNN mAP and recall are similar to ours, indicating that the major performance boost in our logo detector comes from the SDML+ training and matching.

As for the closed-set logo detectors [16, 23, 1] which all used Fast/er R-CNN object detector, the major limitation seems to be the Flickr-32’s small training set. These methods are not scalable to large number of classes (requiring large amount of memory during training) even when there is sufficient amount of training data for each class, which is very expensive. Moreover, they can not recognize new logo classes not seen during training.

7 Conclusions

We presented a two-stage open-set logo detection system (OSLD) that can recognize new logo classes without re-training. We constructed a new logo detection dataset (BLAC) with thousands of logo classes; it will be released for research purposes. We evaluated OSLD on this dataset and on standard Flickr-32 dataset, demonstrated good generalization to unseen logo classes and outperformed both open-set and closed-set logo detection methods by a large margin. The performance boost was largely due to our ‘simple deep metric learning’ framework (SDML) that outperformed the state-of-the-art DML methods with complicated ensemble and attention models, on the standard CUB dataset.

We have also shown that, semi-supervised learning is fairly effective to further improve the performance. There is still ample room for improvement in bbox-based metrics, especially with better and more transferrable generic logo detectors, larger and higher quality training sets, and semi-supervised learning.

References

  • [1] Y. Bao, H. Li, X. Fan, R. Liu, and Q. Jia (2016) Region-based CNN for Logo Detection. In International Conference on Internet Multimedia Computing and Service, pp. 319–322. Cited by: §2, §6.2, §6.2, Table 4.
  • [2] S. Bianco, M. Buzzelli, D. Mazzini, and R. Schettini (2015) Logo Recognition Using CNN Features. In Image Analysis and Processing, pp. 438–448. Cited by: §2.
  • [3] S. Bianco, M. Buzzelli, D. Mazzini, and R. Schettini (2017) Deep Learning for Logo Recognition. Neurocomputing 245, pp. 23–30. Cited by: §2.
  • [4] S. Bianco, M. Buzzelli, D. Mazzini, and R. Schettini (2017) Deep Learning for Logo Recognition. Neurocomputing 245 (C), pp. 23–30. Cited by: §1.
  • [5] Z. Dai, M. Chen, S. Zhu, and P. Tan (2019) Batch Feature Erasing for Person Re-identification and Beyond.

    International Conference on Computer Vision

    .
    Cited by: §4.2, Table 1.
  • [6] C. Eggert, A. Winschel, and R. Lienhart (2015) On the benefit of synthetic data for company logo detection. In ACM International Conference on Multimedia, pp. 1283–1286. Cited by: §2.
  • [7] I. Fehervari and S. Appalaraju (2019) Scalable Logo Recognition using Proxies. In IEEE Winter Conference on Applications of Computer Vision, pp. 715–725. Cited by: §1, §2, §6.2, §6.2, §6.2, §6.2, Table 4.
  • [8] F. N. Iandola, A. Shen, P. Gao, and K. Keutzer (2015)

    DeepLogo: Hitting Logo Recognition with the Deep Neural Network Hammer

    .
    arXiv:1510.02131. Cited by: §1.
  • [9] A. Joly and O. Buisson (2009) Logo retrieval with a contrario visual query expansion. In ACM International Conference on Multimedia, pp. 581–584. Cited by: §2.
  • [10] H. Jun, B. Ko, Y. Kim, I. Kim, and J. Kim (2019) Combination of Multiple Global Descriptors for Image Retrieval. arXiv preprint arXiv:1903.10663. Cited by: §4.2, §4.2, Table 1.
  • [11] Y. Kalantidis, L. G. Pueyo, M. Trevisiol, R. van Zwol, and Y. Avrithis (2011) Scalable triangulation-based logo recognition. In ACM International Conference on Multimedia Retrieval, pp. 1–7. Cited by: §2.
  • [12] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon (2018) Attention-based Ensemble for Deep Metric Learning. In European Conference on Computer Vision, Cited by: §4.2, §4.2, §4.2, Table 1, §4.
  • [13] S. Liu, D. Huang, et al. (2018) Receptive Field Block Net for Accurate and Fast Object Detection. In European Conference on Computer Vision, pp. 385–400. Cited by: §3.1, Table 2.
  • [14] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: Single Shot Multibox Detector. In European conference on computer vision, pp. 21–37. Cited by: §3.1.
  • [15] D. Manandhar, M. Bastan, and K. Yap (2019) Semantic Granularity Metric Learning for Visual Search. arXiv preprint arXiv:1911.06047. Cited by: §4, §4.
  • [16] G. Oliveira, X. Frazão, A. Pimentel, and B. Ribeiro (2016) Automatic Graphic Logo Detection via Fast Region-based Convolutional Networks. In IEEE International Joint Conference on Neural Networks, Cited by: §2, §6.2, §6.2, Table 4.
  • [17] M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2018) Deep Metric Learning with BIER: Boosting Independent Embeddings Robustly. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Table 5, Appendix B, Appendix B, 1st item, 1st item, 3rd item, §4.2, §4.2, §4.2, §4.2, Table 1, §4, §4, §4, §4.
  • [18] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems, pp. 91–99. Cited by: §2.
  • [19] I. Rocco, R. Arandjelovic, and J. Sivic (2017) Convolutional Neural Network Architecture for Geometric Matching. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 39–48. Cited by: 2nd item, 2nd item, 2nd item.
  • [20] S. Romberg and R. Lienhart (2013) Bundle min-hashing for logo recognition. In ACM Conference on International Conference on Multimedia Retrieval, pp. 113–120. Cited by: §2.
  • [21] S. Romberg, L. G. Pueyo, R. Lienhart, and R. Van Zwol (2011) Scalable Logo Recognition in Real-world Images. In ACM International Conference on Multimedia Retrieval, pp. 25. Cited by: §1, §2, §5.
  • [22] H. Su, S. Gong, and X. Zhu (2017) WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web. In IEEE International Conference on Computer Vision, pp. 270–279. Cited by: §2.
  • [23] H. Su, X. Zhu, and S. Gong (2017) Deep Learning Logo Detection with Data Expansion by Synthesising Context. In IEEE Winter Conference on Applications of Computer Vision, pp. 530–539. Cited by: §2, §6.2, §6.2, Table 4.
  • [24] O. Tursun and S. Kalkan (2015) METU Dataset: A Big Dataset for Benchmarking Trademark Retrieval. In International Conference on Machine Vision Applications, pp. 514–517. Cited by: 2nd item.
  • [25] A. Tüzkö, C. Herrmann, D. Manger, and J. Beyerer (2018) Open Set Logo Detection and Retrieval. In International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Cited by: §1, §2, §6.2, §6.2, Table 4.
  • [26] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Cited by: §4.2.
  • [27] M. Wang and W. Deng (2018) Deep Face Recognition: A Survey. arXiv preprint arXiv:1804.06655. Cited by: §1, §2, §4.
  • [28] C. Wu, R. Manmatha, A. J. Smola, and P. Krähenbühl (2017) Sampling Matters in Deep Embedding Learning. In IEEE International Conference on Computer Vision, Cited by: 1st item, §4.2, §4.2, Table 1.
  • [29] D. Wu, S. Zheng, X. Zhang, C. Yuan, F. Cheng, Y. Zhao, Y. Lin, Z. Zhao, Y. Jiang, and D. Huang (2019) Deep Learning-based Methods for Person Re-identification: A Comprehensive Review. Neurocomputing. Cited by: §1, §2, §4.

Appendix A Qualitative Results

Ground Truth Detections
Figure 5: End-to-end logo detection examples on BLAC test set (SDML+ with binomial deviance loss). The numbers show the matching confidence (cosine similarity) out of 100.
Ground Truth Detections
Figure 6: End-to-end logo detection examples on Flickr-32 test set (SDML+ with binomial deviance loss). The numbers show the matching confidence (cosine similarity) out of 100.
Configuration R@1
Baseline: Binomial Deviance Loss [17] (PAMI 18) 58.9
A1. Baseline + BHNM + padding, random crop 73.9
A2. Baseline + BHNM + padding, random crop + optimizer 77.2
A3. Baseline + BHNM + no padding, random crop + optimizer 69.9
A4. Baseline + BHNM + padding, random crop + optimizer 78.0
A5. Baseline + BHNM + padding, random resized crop + optimizer 78.6
Table 5: Ablation study for SDML, with GoogleNet base CNN, input image size and embedding size, binomial deviance loss, on cropped CUB-200-2011 dataset. BHNM: balanced sampling and hard negative mining (Section 4.1). The focus of each experiment is emphasized in bold.

Appendix B SDML Ablation Study

Here we present the results and analysis of ablation studies to determine the impact of sampling, preprocessing and optimization in our ‘simple deep metric learning (SDML)’ framework. The results are shown in Table B. We start with the original binomial deviance loss in [17] as our baseline (first row in the table, taken from [17]), and continue with our formulation of the binomial deviance loss (Section 4) in the rest of the experiments.

When we use balanced sampling and hard negative mining (BHNM), the R@1 accuracy improves from to , points. BHNM has the highest positive impact on the accuracy in our SDML framework.

The baseline in [17] uses Adam optimizer with a learning rate of , which is quite low and takes very long to converge; we run the experiment in A1 for 600 epochs to obtain this accuracy, with room for improvement with more training. In experiment A2, we use our optimization setting with larger learning rates and learning rate schedulers, as described in Section 4.1. With this setting, the accuracy improves from to and with only epochs of training (10 epochs for stage 1, 50 epochs for stage 2), compared to 600 in A1.

As for preprocessing and data augmentation, padding to square size before resizing has a significant impact of points (experiments A3 vs A4). Random resized crop (with scale in [0.8, 1.0], aspect ratio in [0.9, 1.1]) is only slightly better than random crop (first resize to , then random crop ).