Logo detection in unconstrained scene images is crucial for a variety of real-world vision applications, such as brand trend prediction for commercial research and vehicle logo recognition for intelligent transportation [Romberg et al.(2011)Romberg, Pueyo, Lienhart, and Van Zwol, Romberg and Lienhart(2013), Pan et al.(2013)Pan, Yan, Xu, Sun, Shao, and Wu]. It is inherently a challenging task due to the presence of varying sized logo instances in arbitrary scenes with uncontrolled illumination, multi-resolution, occlusion and background clutter (Fig. 1 (c)). Existing logo detection methods typically consider a small number of logo classes with the need for large sized training data annotated with object bounding boxes [Joly and Buisson(2009), Kalantidis et al.(2011)Kalantidis, Pueyo, Trevisiol, van Zwol, and Avrithis, Romberg et al.(2011)Romberg, Pueyo, Lienhart, and Van Zwol, Revaud et al.(2012)Revaud, Douze, and Schmid, Romberg and Lienhart(2013), Boia et al.(2014)Boia, Bandrabur, and Florea, Li et al.(2014)Li, Chen, Su, Duh, Zhang, and Li, Pan et al.(2013)Pan, Yan, Xu, Sun, Shao, and Wu, Liao et al.(2017)Liao, Lu, Zhang, Wang, and Tang]. Whilst this controlled setting allows for a straightforward adoption of the state-of-the-art general object detection models [Ren et al.(2015)Ren, He, Girshick, and Sun, Girshick(2015), Redmon and Farhadi(2017)], it is unscalable to dynamic real-world logo detection applications where more new logo classes become of interest during model deployment, with the availability of only their clean design images (Fig. 1(a)). To satisfy such incremental demands, prior methods are significantly limited by the extremely high cost needed for labelling a large set of per-class training logo images [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, et al.]
. Whilst this requirement is significant for practical deployments, it is ignored in existing logo detection benchmarks which consider only the unscalable fully supervised learning evaluations.
This work considers the realistic and challenging open-ended logo detection challenge. To that end, we introduce a new Open Logo Detection problem, where we have limited fine-grained object bounding box annotation in real scene images for only a small proportion of logo classes (supervised) with the remaining classes (the majority) totally unlabelled (unsupervised). As logo is a visual symbol, we have the clean logo designs for all target classes (Fig. 1(a)). The objective is to establish a logo detection model for all logo classes by exploiting the small labelled training set and logo design images in a scalable manner. One approach to open logo detection is by jointly learning logo detection and classification as YOLO9000 [Redmon and Farhadi(2017)]
so that the model can learn to detect logo objects from the labelled training images while learn to classify all logo design images. This method relies on robust object detection generalisation learned from labelled classes to unlabelled classes, and rich appearance diversity of object instances. Both assumption are invalid in our setting. Another alternative approach is synthesising training data[Su et al.(2017b)Su, Zhu, and Gong] which overlays logo designs with geometry/illumination variations into context images at random scales and locations. But, it introduces appearance inconsistency between logo instances and scene context (Fig. 3(a)), which may impede the model generalisation.
In this work, we address the open logo detection challenge by presenting a Context Adversarial Learning (CAL) approach to automatically generate context consistent synthetic training data. Specifically, the CAL takes as input artificial images with superimposed logo designs [Su et al.(2017b)Su, Zhu, and Gong]
, and outputs corresponding images with context consistent logo instances. This is a pixel prediction process, which we formulate as an image-to-image translation problem in the Generative Adversarial Network framework[Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio].
Our contributions are: (1) We scale up logo detection to dynamic real-world applications without fine-grained labelled training data for newly coming logo classes and present a novel Open Logo Detection setting. This differs significantly from existing fully supervised logo detection problems with the exhaustive need for object instance box labelling for all classes and hence having poor deployment scalability in reality. To our knowledge, this is the first attempt of investigating such a scalable logo detection scenario in the literature. (2) We introduce a new QMUL-OpenLogo benchmark for providing a standard test of open logo detection and facilitating a like-for-like comparative evaluation in the future studies. QMUL-OpenLogo is created based on 7 publicly available logo detection datasets through a careful logo classes merging and filtering process along with a benchmarking evaluation protocol. (3) We propose a Context Adversarial Learning (CAL) approach to synthesising context coherent training data for enabling effective learning of state-of-the-art object detection models in order to tackle the open logo detection challenge in a scalable manner. Importantly, CAL requires no exhaustive human labelling therefore generally applicable to any unsupervised logo classes. Experiments show the performance advantage of CAL for open logo detection on the QMUL-OpenLogo benchmark in comparison to state-of-the-art approaches YOLO9000 [Redmon and Farhadi(2017)] and Synthetic Context Logo (SCL) [Su et al.(2017b)Su, Zhu, and Gong] in contemporary object detection frameworks.
2 Related Works
Logo Detection Traditional methods for logo detection rely on hand-crafted features and sliding window based localisation [Li et al.(2014)Li, Chen, Su, Duh, Zhang, and Li, Revaud et al.(2012)Revaud, Douze, and Schmid, Romberg and Lienhart(2013), Boia et al.(2014)Boia, Bandrabur, and Florea, Kalantidis et al.(2011)Kalantidis, Pueyo, Trevisiol, van Zwol, and Avrithis]. Recently, deep learning methods [Iandola et al.(2015)Iandola, Shen, Gao, and Keutzer, Hoi et al.(2015)Hoi, Wu, Liu, Wu, Wang, Xue, and Wu, Su et al.(2017b)Su, Zhu, and Gong, Su et al.(2017a)Su, Gong, and Zhu, Liao et al.(2017)Liao, Lu, Zhang, Wang, and Tang] have been proposed which use generic object detection models [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik, Ren et al.(2015)Ren, He, Girshick, and Sun, Girshick(2015), Redmon and Farhadi(2017)]. However, these methods are not scalable to realistic large deployments due to the need for: (1) Accurately labelled training data per logo class; (2) Strong object-level bounding box annotations. One exception is [Su et al.(2017a)Su, Gong, and Zhu, Su et al.(2018)Su, Gong, and Zhu] where noisy web logo images are exploited without manual labelling of object instance boxes. This method exploits a huge quantity of data to mine sufficient correct logo images, and is restricted for non-popular and new brand logos which may lack web data. Moreover, all the above-mentioned methods assume the availability of real training images for ensuring model generalisation. This further reduces their scalability and usability in real-world scenarios when many logo classes have no training images from real scenes such as those newly introduced logos. In this work, we investigate this under-studied Open Logo Detection setting, where the majority of logo classes have no training data.
Synthetic Data There are previous attempts to exploit synthetic data for training deep CNN models. Peng et al. [Peng et al.(2015)Peng, Sun, Ali, and Saenko] used 3D CAD object models to generate 2D images by varying the projections and orientations to augment the training data in few-shot learning scenarios. This method is based on the R-CNN model [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik] with the proposal generation component independent from fine-tuning the classifier, making the correlation between objects and background context suboptimal. The work of [Su et al.(2015)Su, Qi, Li, and Guibas] used synthetic data rendered from 3D models against varying background to enhance the training images of a pose model. Su et al. [Su et al.(2017b)Su, Zhu, and Gong] similarly generated synthetic images by overlaying logo instances with appearance changes on random background images. Rather than randomly placing exemplar objects [Su et al.(2015)Su, Qi, Li, and Guibas, Su et al.(2017b)Su, Zhu, and Gong, Peng et al.(2015)Peng, Sun, Ali, and Saenko], Georgakis et al. [Georgakis et al.(2017)Georgakis, Mousavian, Berg, and Kosecka] performed object-scene compositing based on accurate scene segmentation, similar as [Gupta et al.(2016)Gupta, Vedaldi, and Zisserman] for text localisation. These existing works mostly aim to generate images with varying object appearance. In contrast, we consider the consistency between objects and the surrounding context for generating appearance coherent synthetic images. Conceptually, our method is complementary to the aforementioned approaches when applied concurrently.
3 QMUL-OpenLogo: Open Logo Detection Benchmark
For enabling open logo detection performance test, we need to establish a corresponding benchmark which the literature lacks. To that end, it is necessary to collect a large number of logo classes for simulating the real-world deployments at scales. Given the tedious process of logo class selection, image data collection and filtering, as well as fine-grained bounding box annotation [Su et al.(2012)Su, Deng, and Fei-Fei], we propose to re-exploit the existing logo detection datasets.
|Dataset||Logos||Images||minmax (mean) Instances / Class||minmax (mean) Scale (%)|
|FlickrLogos-27 [Kalantidis et al.(2011)Kalantidis, Pueyo, Trevisiol, van Zwol, and Avrithis]||27||810||35213 (80.52)||0.0160100.0 (19.56)|
|FlickrLogos-32 [Romberg et al.(2011)Romberg, Pueyo, Lienhart, and Van Zwol]||32||2,240||73204 (106.38)||0.020099.09 (9.16)|
|Logo32plus [Bianco et al.(2017)Bianco, Buzzelli, Mazzini, and Schettini]||32||7,830||132576 (338.06)||0.0190100.0 (4.51)|
|BelgaLogos [Joly and Buisson(2009)]||37||1,321||2223 (57.08)||0.023069.04 (0.91)|
|WebLogo-2M(Test) [Su et al.(2017a)Su, Gong, and Zhu]||194||4,318||18204 (40.63)||0.018099.67 (7.69)|
|Logo-In-The-Wild [Tüzkö et al.(2017)Tüzkö, Herrmann, Manger, and Beyerer]||1196||9,393||11080 (23.49)||0.000795.91 (1.80)|
|SportsLogo [Liao et al.(2017)Liao, Lu, Zhang, Wang, and Tang]||20||1,978||108292 (152.25)||0.010099.41 (9.89)|
|QMUL-OpenLogo||352||27,083||101,902 (88.25)||0.0014100.0 (6.09)|
Source data selection To maximise the context richness of logo images, we assemble 7 existing publicly accessible logo detection datasets (Table 1) sourced from diverse domains to establish the QMUL-OpenLogo evaluation benchmark. All these datasets together present significant logo variations and therefore represent the truthful logo detection challenge as encountered in real-world unconstrained deployments. We only used the test data of WebLogo-2M [Su et al.(2017a)Su, Gong, and Zhu] since its training data are noisy without labelled object bounding boxes which are required for model performance evaluation.
Logo annotation and class refinement We need to make logo class definition consistent in QMUL-OpenLogo provided that different definitions exist across datasets. In particular, Logo-In-The-Wild (LITW) treats different logo variations of the same brand as distinct logo classes. For example, Adidas trefoil/text are treated as two different classes in LITW but as one class in all other datasets. We adopted the latter more common criterion by merging all fine-grained same-brand logo classes from LITW. We combined all logo image data of the same logo class from all selected datasets. We also cleaned up erroneous annotations by removing those with the size of bounding box exceeds the whole image size and/or obviously wrong box coordinates (e.g. > ). To ensure that each logo class has sufficient test data, we further removed those extremely small classes with less than 10 images. Moreover, we manually verified 13 random images per class and filtered out those classes with incorrect labels on selected images. These refinements result in a QMUL-OpenLogo dataset with 27,189 images of 309 logo classes (Table 1).
|Type||Classes||Train Img||Val Img||Test Img||Logo Design Img|
|Supervised||32||1,280||1,019||9,168||32 (1 per class)|
|Unsupervised||320||0||1,562||14,054||320 (1 per class)|
|Total||352||1,280||2,581||23,222||352 (1 per class)|
Train/val/test data partition For model training and evaluation on the QMUL-OpenLogo dataset as a test benchmark, we standardise the train/val/test sets in the following two steps: (1) We split all logo classes into two disjoint groups: one is supervised with labelled bounding boxes in real images, and the other is unsupervised, i.e. the open logo detection setting. In particular, we selected all 32 classes in the popular FlickrLogo32 dataset [Romberg et al.(2011)Romberg, Pueyo, Lienhart, and Van Zwol] as supervised ones whilst the remaining 277 classes as unsupervised. (2) For each supervised class, we assigned the original trainval set (40 images per class) of FlickrLogo32 as the train data. For open logo detection, no real training images are available for unsupervised classes. Among the remaining images, we further performed a random 10%/90% split for the val/test partition on each class. The data split is summarised in Table 2.
Logo design images Similar to [Su et al.(2017b)Su, Zhu, and Gong], we obtained the clean logo design images from the Google Image Search by querying the corresponding logo class name. These images define the logo detection tasks, one for each logo class (Fig. 1 (a)).
Benchmark properties The QMUL-OpenLogo benchmark has three characteristics: (1) Highly imbalanced logo frequency distribution (“Instances/Class” in Table 1); (2) Significant logo scale variation (“Scale” in Table 1); (3) Rich scene context (Fig. 1(c)). These factors are essential for creating a benchmark entailing the true performance test of logo detection algorithms in real-world large scale deployments.
4 Synthesising Images by Context Adversarial Learning
In open logo detection, there are training images only for a small number of supervised classes , whilst no training data for other unsupervised classes (Table 2). To enable state-of-the-art detection models [Redmon and Farhadi(2017), Ren et al.(2015)Ren, He, Girshick, and Sun, Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, and Reed], we exploit the potential of synthetic training data. To this end, we propose a Context Adversarial Learning (CAL) approach for rendering logo instance against scene context and improving the appearance consistency.
4.1 Context Adversarial Learning
The proposed CAL takes as input initial synthetic images with logo objects to generate the corresponding context consistent synthetic images. These output images serve as additional training data for enhancing state-of-the-art detection model generalisation on real-world unconstrained images. CAL is therefore a detection data augmentation strategy with focus on logo context optimisation. Conceptually, it is totally complementary to other existing data augmentation methods widely adopted for training classification and detection deep learning models, e.g. flipping, rotation, random noise, scaling [Ren et al.(2015)Ren, He, Girshick, and Sun, Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Redmon and Farhadi(2017), Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, and Reed]. In this study, we adopt the SCL [Su et al.(2017b)Su, Zhu, and Gong] to generate the initial synthetic data, i.e. superimpose the logo design images with spatial and colour transformation into any given natural scene images (Fig. 3(a)).
Model Formulation We consider the CAL as an image-to-image “translation” problem, i.e. translating one representation of a logo scene image into another
of the same content at the pixel level. We particularly focus on rendering the logo objects to be more consistent with the scene context. Recently, deep neural networks have been verified as strong models capable of learning to minimise a given objective loss function[Goodfellow et al.(2016)Goodfellow, Bengio, Courville, and Bengio, LeCun et al.(2015)LeCun, Bengio, and Hinton]
. A straightforward solution may be the common convolutional neural networks (CNNs) which can be supervised to minimise the Euclidean distance between the predicted and ground truth pixel values. However, such modelling may lead to blurring results provided that the objective loss is minimised by averaging all plausible outputs[Pathak et al.(2016)Pathak, Krahenbuhl, Donahue, Darrell, and Efros, Isola et al.(2017)Isola, Zhu, Zhou, and Efros]. How to generate realistic images, the core of CAL, remains a generally unsolved problem for CNNs.
Interestingly, this task is exactly the formulation purpose of the recently proposed Generative Adversarial Networks (GANs) [Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio, Radford et al.(2015)Radford, Metz, and Chintala, Denton et al.(2015)Denton, Chintala, Fergus, et al., Salimans et al.(2016)Salimans, Goodfellow, Zaremba, Cheung, Radford, and Chen] – making the generated images indistinguishable from realistic ones. Unlike the manually designed loss functions in CNNs, a GAN model automatically learns a loss that tries to classify if an output image is real (e.g. context consistent) or fake (e.g. context inconsistent), while simultaneously training a generative model to minimise this loss. Blurry logo images are clearly fake and therefore well suppressed. Given the dependence on initial synthetic image in our context, we explore the image-conditioned GAN which learns a conditional generative model [Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio, Isola et al.(2017)Isola, Zhu, Zhou, and Efros]. Formally, the objective value function of a conditional GAN can be written as:
where the generator tries to minimise this objective value against an adversarial discriminator which instead tries to maximise the value. Inspired by the modelling benefits from combining the GAN objective with pixel distance loss [Isola et al.(2017)Isola, Zhu, Zhou, and Efros, Pathak et al.(2016)Pathak, Krahenbuhl, Donahue, Darrell, and Efros], we enhance the conditional adversarial loss (Eq. (1)) with an loss to further suppress the blurring possibility:
where controls the weight of the pixel matching loss. We empirically set in our experiments. As such, the generator learning is also tied with the task of being close to the ground truth output in addition to fooling the discriminator, whilst the discriminator learning remains unchanged. The optimal solution is:
The noise input aims to learn mapping from a distribution (e.g. Gaussian [Wang and Gupta(2016)]) to the target domain . However, this strategy is often ineffective to capture the stochasticity of conditional distributions with the noise largely neglected [Isola et al.(2017)Isola, Zhu, Zhou, and Efros, Mathieu et al.(2015)Mathieu, Couprie, and LeCun]. Fortunately, it is not highly necessary to fully model this distribution mapping in our problem due to the presence of potentially infinite synthetic images by SCL. More specifically, the variation of can be more easily captured by sampling the input synthetic logo images, than modelling the entropy of the conditional distributions through learning a mapping from one distribution to another. Without , the model still learns a mapping from to in a deterministic manner.
Network Architecture We adopt the same generator and discriminator architectures as [Isola et al.(2017)Isola, Zhu, Zhou, and Efros]. Specifically, the generator is an encoder-decoder network in a U-Net architecture with the encoder being a 8-layer CNN net whilst the decoder with a minor structure. The discriminator is a 4-layers CNN net. All convolution layers use 4
4 filters with stride 2.
Training Images To train the CAL model, we need a set of training image pairs. For generalising the model to rendering the synthetic images by SCL [Su et al.(2017b)Su, Zhu, and Gong] at test time, we also apply the SCL to automatically build the training data. Specifically, given any natural image , we select a region (either a random rectangle or object foreground) and render it by SCL transformation including image sharpening, median filtering, random colour changes and colour reduction. This results in a CAL training image pair (, ) where is the image with inconsistent object region. We select two image sources for enhancing context richness: (1) Non-logo background images from FlickrLogo32 [Romberg et al.(2011)Romberg, Pueyo, Lienhart, and Van Zwol] on which we use random rectangle regions to generate training pairs; and (2) MS COCO images [Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick] on which we utilise the object masks to make training pairs. Importantly, this method requires no additional labelling in creating training data. A number of examples are given in Fig. 2.
Model Optimisation and Deployment Given the training data, we adopt the standard optimisation approach as [Isola et al.(2017)Isola, Zhu, Zhou, and Efros] to train the CAL: we alternate between optimising the discriminator , and then the generator in each mini-batch SGD. We train to minimise rather than as suggested in [Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio]. We slow down the learning of by applying half gradient. To focus the CAL model on context inconsistent region, we additionally input the region mask together with in another channel, leading to a 4-channels input. Once the CAL is trained, it can be used to perform context rendering on the SCL synthesised images. Example CAL synthesised images are shown in Fig. 3 (b).
4.2 Multi-Class Logo Detection Model Training
Given both SCL and CAL synthesised images and real (supervised classes only) images, we train a pre-selected deep learning detection model [Ren et al.(2015)Ren, He, Girshick, and Sun, Redmon and Farhadi(2017)]. First training on synthetic data and then fine-tuning on real images may make the detection model biased towards supervised logo classes whilst significantly hurting the performance on other unsupervised classes, as we will show in the experiments (Table 4).
Competitors We compared our CAL approach with two state-of-the-art approaches allowing for open logo detection: (1) SCL [Su et al.(2017b)Su, Zhu, and Gong]: A state-of-the-art method for generating synthetic detection training data with random logo design transformation and unconstrained background context. This enables the exploitation of state-of-the-art detection models same as the CAL. We selected two strong deep learning detectors: Faster R-CNN [Ren et al.(2015)Ren, He, Girshick, and Sun] and YOLOv2 [Redmon and Farhadi(2017)]. (2) YOLO9000 [Redmon and Farhadi(2017)]: A state-of-the-art deep learning detection model based on the YOLO architecture [Redmon et al.(2016)Redmon, Divvala, Girshick, and Farhadi]. This model is designed particularly to scale up the detector to a large number of classes without exhaustive object instance box labelling. The key idea is by jointly learning the model from both bounding box labelled training data of supervised classes and image level classification training data of all classes using mixture mini-batches. We adopt the softmax cross-entropy loss for classification rather than the hierarchy aware loss as used in [Redmon and Farhadi(2017)], since there is no semantic hierarchy on logo classes. To improve the model classification robustness against uncontrolled context, we further performed context augmentation on logo design images (Fig. 1(a)) using the SCL method [Su et al.(2017b)Su, Zhu, and Gong].
Performance Metric For the performance measure of logo detection, we adopted the standard Average Precision (AP) for each individual logo class, and the mean Average Precision (mAP) for all classes [Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and Zisserman]. A detection is considered to be correct when the Intersection over Union (IoU) between the predicted and ground-truth exceeds .
Implementation Details For CAL model optimisation, we adopted the Adam solver [Kingma and Ba(2014)] at the learning rate of 0.0002 and momentum parameters and . For SCL [Su et al.(2017b)Su, Zhu, and Gong] and CAL, we generated 100 synthetic images per logo class (in total 35,200).
|Method||All Class||Uns Class||Sup Class||Big Logo||Small Logo|
|YOLO9000[Redmon and Farhadi(2017)]||4.19||1.98||26.33||6.23||1.15|
|YOLOv2[Redmon and Farhadi(2017)]+SCL[Su et al.(2017b)Su, Zhu, and Gong]||12.10||8.75||45.58||17.66||5.92|
|YOLOv2[Redmon and Farhadi(2017)]+CAL||13.14||9.55||49.17||18.25||6.29|
|Abs/Rel Gain (%)||1.04/8.60||0.80/9.14||3.59/7.88||0.59/3.34||0.37/6.25|
|FR-CNN[Ren et al.(2015)Ren, He, Girshick, and Sun]+SCL[Su et al.(2017b)Su, Zhu, and Gong]||12.35||8.51||50.74||16.94||7.87|
|FR-CNN[Ren et al.(2015)Ren, He, Girshick, and Sun]+CAL||13.13||9.34||51.03||17.68||8.69|
|Abs/Rel Gain (%)||0.78/6.32||0.83/9.75||0.29/0.57||0.74/4.37||0.82/10.41|
Comparative Evaluations The comparative results on the QMUL-OpenLogo benchmark are shown in Table 3. To look into the detailed performance, we further evaluate the performance on unsupervised and supervised logo classes, as well as big (46.7%) and small (53.3%) logo instances split with a threshold of 0.02 scale ratio. We have these observations: (1) All methods produce rather poor results ( mAP) on the QMUL-OpenLogo benchmark, suggesting that the scalability of current solutions remains unsatisfied in open logo detection deployments with the need for further investigation. (2) YOLO9000 [Redmon and Farhadi(2017)] yields the weakest performance among all methods. This suggests that joint learning of object classification and detection in a single loss formulation is ineffective to solve this challenging problem, particularly when the classification training data (clean logo design images) are limited in appearance variations. It is also found that such modelling can negatively affect the performance on supervised classes with fine-grained labelled training data. (3) With CAL, YOLOv2 achieves the best logo detection performance. While the absolute gain of CAL over the state-of-the-art data synthesising method SCL is small, larger relative gains are achieved using either YOLOv2 or Faster R-CNN. (4) The accuracy on supervised logo classes is much better than that on unsupervised ones. This indicates the high reliance on the manually labelled training data for existing state-of-the-art detection models, and the unsolved challenge of learning from auto-generated synthetic images. (5) Small logos benefit the largest relative gain from CAL. This is reasonable because, given limited appearance details of small instances, the external contextual information becomes more important for achieving accurate localisation and recognition.
To further justify the weak performance of state-of-the-art methods on the new QMUL-OpenLogo challenge, we evaluated Faster R-CNN+CAL on the most popular benchmark FlickrLogos-32 [Romberg et al.(2011)Romberg, Pueyo, Lienhart, and Van Zwol]. We obtained 74.9% mAP, which closely matches the 73.3% mAP of [Iandola et al.(2015)Iandola, Shen, Gao, and Keutzer] similarly using a deep CNN model.
Qualitative Examination Fig 4 shows four test examples by SCL and CAL based on Faster R-CNN. For big logo instances of “Danone” in clean background, both models succeed. For moderate “Chiquita” with viewpoint distortion and small “Fiat” with subtle appearance, the SCL model fails while CAL remains successful. For small “Kellogs” instance against complex background clutter, both model fail.
Further Analysis In Table 4, we evaluated the performance of Faster R-CNN (1) trained on the mixture of synthetic and real data, or (2) trained firstly on synthetic data and then fine-tuned real data. It is evident that the former produces better performance except on supervised classes. This is as expected since the latter will bias the model towards supervised logo classes in the fine-tuning stage whilst largely degrading the generalisation capability on unsupervised classes.
|Method||All Class||Uns Class||Sup Class||Big Logo||Small Logo|
Fully Supervised Learning Evaluation To more extensively evaluate the QMUL-OpenLogo dataset, we further benchmarked a fully supervised learning setting where each logo class has real training data. In particular, for every logo class, we made a 60%/10%/30% train/val/test image split at random. The data statistics are detailed in Table 5. With Faster-RCNN, we obtain a mAP performance of 48.3%, much higher than the best corresponding open logo detection rate 13.3% (Table 3).
In this work, we presented a new benchmark called QMUL-OpenLogo for enabling faithful performance test of logo detection algorithms in more realistic and challenging deployment scenarios. In contrast to existing closed benchmarks, QMUL-OpenLogo considers an open-end logo detection scenario where most classes are unsupervised – a simulation of incrementally arriving new logo classes without exhaustively labelled training data at fine-grained bounding box level during deployment. This benchmark therefore uniquely provides a more realistic evaluation of algorithms for logo detection in scalable and dynamic deployment with limited labelling budget. We further introduced a Context Adversarial Learning (CAL) approach to synthetic training data generation for enabling the learning optimisation of state-of-the-art supervised object detection model even given unsupervised logo classes. Empirical evaluations show the performance advantages of our CAL method over the state-of-the-art alternative detection and synthesising methods on the newly introduced QMUL-OpenLogo benchmark. We also provided detailed model performance analyses on different types of test data for giving insights on the specific challenges of the proposed more realistic open logo detection.
This work was partially supported by the China Scholarship Council, Vision Semantics Ltd, Royal Society Newton Advanced Fellowship Programme (NA150459), and Innovate UK Industrial Challenge Project on Developing and Commercialising Intelligent Video Analytics Solutions for Public Safety (98111-571149).
- [Bianco et al.(2017)Bianco, Buzzelli, Mazzini, and Schettini] Simone Bianco, Marco Buzzelli, Davide Mazzini, and Raimondo Schettini. Deep learning for logo recognition. Neurocomputing, 245:23–30, 2017.
- [Boia et al.(2014)Boia, Bandrabur, and Florea] Raluca Boia, Alessandra Bandrabur, and Catalin Florea. Local description using multi-scale complete rank transform for improved logo recognition. In IEEE International Conference on Communications, pages 1–4, 2014.
- [Denton et al.(2015)Denton, Chintala, Fergus, et al.] Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a￼ laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems, pages 1486–1494, 2015.
[Everingham et al.(2010)Everingham, Van Gool, Williams, Winn, and
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew
The pascal visual object classes (voc) challenge.
International journal of computer vision, 88(2):303–338, 2010.
- [Georgakis et al.(2017)Georgakis, Mousavian, Berg, and Kosecka] Georgios Georgakis, Arsalan Mousavian, Alexander C Berg, and Jana Kosecka. Synthesizing training data for object detection in indoor scenes. arXiv preprint arXiv:1702.07836, 2017.
- [Girshick(2015)] Ross Girshick. Fast r-cnn. In IEEE International Conference on Computer Vision, 2015.
[Girshick et al.(2014)Girshick, Donahue, Darrell, and
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik.
Rich feature hierarchies for accurate object detection and semantic
IEEE Conference on Computer Vision and Pattern Recognition, 2014.
- [Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- [Goodfellow et al.(2016)Goodfellow, Bengio, Courville, and Bengio] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
- [Gupta et al.(2016)Gupta, Vedaldi, and Zisserman] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic data for text localisation in natural images. arXiv e-prints, 2016.
- [Hoi et al.(2015)Hoi, Wu, Liu, Wu, Wang, Xue, and Wu] Steven CH Hoi, Xiongwei Wu, Hantang Liu, Yue Wu, Huiqiong Wang, Hui Xue, and Qiang Wu. Logo-net: Large-scale deep logo detection and brand recognition with deep region-based convolutional networks. arXiv preprint arXiv:1511.02462, 2015.
- [Iandola et al.(2015)Iandola, Shen, Gao, and Keutzer] Forrest N Iandola, Anting Shen, Peter Gao, and Kurt Keutzer. Deeplogo: Hitting logo recognition with the deep neural network hammer. arXiv, 2015.
[Isola et al.(2017)Isola, Zhu, Zhou, and Efros]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.
Image-to-image translation with conditional adversarial networks.In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- [Joly and Buisson(2009)] Alexis Joly and Olivier Buisson. Logo retrieval with a contrario visual query expansion. In ACM International Conference on Multimedia, pages 581–584, 2009.
- [Kalantidis et al.(2011)Kalantidis, Pueyo, Trevisiol, van Zwol, and Avrithis] Yannis Kalantidis, Lluis Garcia Pueyo, Michele Trevisiol, Roelof van Zwol, and Yannis Avrithis. Scalable triangulation-based logo recognition. In ACM International Conference on Multimedia Retrieval, page 20, 2011.
- [Kingma and Ba(2014)] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv, 2014.
- [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
- [LeCun et al.(2015)LeCun, Bengio, and Hinton] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
- [Li et al.(2014)Li, Chen, Su, Duh, Zhang, and Li] Kuo-Wei Li, Shu-Yuan Chen, Songzhi Su, Der-Jyh Duh, Hongbo Zhang, and Shaozi Li. Logo detection with extendibility and discrimination. Multimedia tools and applications, 72(2):1285–1310, 2014.
- [Liao et al.(2017)Liao, Lu, Zhang, Wang, and Tang] Yuan Liao, Xiaoqing Lu, Chengcui Zhang, Yongtao Wang, and Zhi Tang. Mutual enhancement for detection of multiple logos in sports videos. In IEEE International Conference on Computer Vision, 2017.
- [Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision. 2014.
- [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, and Reed] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, and Scott Reed. Ssd: Single shot multibox detector. In European Conference on Computer Vision, 2016.
- [Mathieu et al.(2015)Mathieu, Couprie, and LeCun] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
- [Pan et al.(2013)Pan, Yan, Xu, Sun, Shao, and Wu] Chun Pan, Zhiguo Yan, Xiaoming Xu, Mingxia Sun, Jie Shao, and Di Wu. Vehicle logo recognition based on deep learning architecture in video surveillance for intelligent traffic system. In IET International Conference on Smart and Sustainable City, pages 123–126, 2013.
- [Pathak et al.(2016)Pathak, Krahenbuhl, Donahue, Darrell, and Efros] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016.
- [Peng et al.(2015)Peng, Sun, Ali, and Saenko] Xingchao Peng, Baochen Sun, Karim Ali, and Kate Saenko. Learning deep object detectors from 3d models. In IEEE International Conference on Computer Vision, pages 1278–1286, 2015.
- [Radford et al.(2015)Radford, Metz, and Chintala] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- [Redmon and Farhadi(2017)] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- [Redmon et al.(2016)Redmon, Divvala, Girshick, and Farhadi] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- [Ren et al.(2015)Ren, He, Girshick, and Sun] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.
- [Revaud et al.(2012)Revaud, Douze, and Schmid] Jerome Revaud, Matthijs Douze, and Cordelia Schmid. Correlation-based burstiness for logo retrieval. In ACM International Conference on Multimedia, pages 965–968, 2012.
- [Romberg and Lienhart(2013)] Stefan Romberg and Rainer Lienhart. Bundle min-hashing for logo recognition. In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval, pages 113–120. ACM, 2013.
- [Romberg et al.(2011)Romberg, Pueyo, Lienhart, and Van Zwol] Stefan Romberg, Lluis Garcia Pueyo, Rainer Lienhart, and Roelof Van Zwol. Scalable logo recognition in real-world images. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, page 25. ACM, 2011.
- [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, et al.] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- [Salimans et al.(2016)Salimans, Goodfellow, Zaremba, Cheung, Radford, and Chen] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
- [Su et al.(2017a)Su, Gong, and Zhu] Hang Su, Shaogang Gong, and Xiatian Zhu. Weblogo-2m: Scalable logo detection by deep learning from the web. In Workshop of the IEEE International Conference on Computer Vision, 2017a.
- [Su et al.(2017b)Su, Zhu, and Gong] Hang Su, Xiatian Zhu, and Shaogang Gong. Deep learning logo detection with data expansion by synthesising context. IEEE Winter Conference on Applications of Computer Vision, 2017b.
- [Su et al.(2018)Su, Gong, and Zhu] Hang Su, Shaogang Gong, and Xiatian Zhu. Scalable deep learning logo detection. arXiv preprint arXiv:1803.11417, 2018.
[Su et al.(2012)Su, Deng, and Fei-Fei]
Hao Su, Jia Deng, and Li Fei-Fei.
Crowdsourcing annotations for visual object detection.
Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence, volume 1, 2012.
[Su et al.(2015)Su, Qi, Li, and Guibas]
Hao Su, Charles R Qi, Yangyan Li, and Leonidas J Guibas.
Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views.In IEEE International Conference on Computer Vision, 2015.
- [Tüzkö et al.(2017)Tüzkö, Herrmann, Manger, and Beyerer] Andras Tüzkö, Christian Herrmann, Daniel Manger, and Jürgen Beyerer. Open set logo detection and retrieval. arXiv preprint arXiv:1710.10891, 2017.
- [Wang and Gupta(2016)] Xiaolong Wang and Abhinav Gupta. Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision, pages 318–335, 2016.