By nature, humans are visual learners. A single snapshot of a product provides more information about the product than a wall of text. Few customers prefer to read large amounts of text when shopping online. According to a research from, Nielson Norman(Nielsen, 1997), only 16 of the readers actually read word-for-word and 79 only gloss over the highlights. In e-commerce, captivating images can keep customers on the product page long enough for them to read the detailed product pitch. Product images are important for a number of reasons:
They build customers’ confidence in the product quality
They help customers better understand the product before making a purchase decision
Despite the well-known importance of images, e-commerce retailers who offer marketplace platforms, often struggle to control image quality. These e-commerce websites usually display content that was created by both in-house and by third-party content creators (i.e., content service providers, marketplace sellers, suppliers). All external and internal content creators are expected to meet the retailer’s Trust Safety guidelines. These guidelines are constantly changed and expanded, which makes it incredibly difficult for e-commerce retailers to ensure that external content providers are complying with Trust Safety guidelines.
In this paper, we address following types of compliance guideline violations:
Offensive Images : Figure 1 shows various types of offensive images. The examples include images that have nudity, sexually explicit content, abusive text, objects used to promote violence, and racially inappropriate content. It is common that an image is marked as offensive whereas the product that it belongs to is not (e.g., a CD with nudity on its album cover).
Non-compliant Images : Most e-commerce retailers have published compliance guidelines on what can be sold on their platform. Figure 1 [iii] and [iv] shows products that are non-compliant, such as an assault rifles and a toy that resembles assault-style rifles. Images of such products are considered non-compliant.
Logos and Badges : A wide range of logos and banners are considered non-compliant. In Figure 2, the image located second from the left uses a self proclaimed marketing logo to lure the customer to click on it. Use of self proclaimed marketing badges to increase the likelihood of customers clicking on their products in the search results or on the browse page is a common malpractice. Other non-compliant logos include competitors’ logos, logos claiming top-rank or best-selling, inaccurate manufacturing country logos (e.g., Made in USA logo), and many others (as seen in Figure 3).
Traditionally, the following two non-technical solutions have been tried by the retailers:
Put a disclaimer on the website that the displayed content is not owned by the retailer
Allow the customers to report non-compliant content so that they can be filtered using a human workforce
Clearly, these options do not protect the customer from having an unpleasant experience. The disclaimer, which often goes unnoticed, fails to protect the retailer’s brand value.
In this paper, we discuss a machine learning based system that helps a retailer enforce its compliance terms conditions pertaining to product content. The concepts learnt while developing this system can be leveraged by any service that serves image or other visual content to human users on the web such as social media feeds, ads platforms, etc.
2 Technical Challenges
As discussed in Section 1
, it is crucial for any e-commerce company or for that matter, any platform hosting visual content to have a system that efficiently and proactively detects and removes non-compliant images. Rule-based systems where the rules are based on NLP techniques, such as TF-IDF, are often proposed as the baseline solution. In e-commerce, such a system relied on product title and description. However, in many cases, the product itself is compliant, but the image may not be (e.g., a music CD with a nude photo on the CD). A textual rule-based system would fail to capture such an example. Use of optical character recognition (OCR) to extract non-compliant text from images is another conventional solution. However, OCR is accurate enough only if the image meets certain conditions. Also, OCR cannot capture a wide range of problems, such as nudity or an assault rifle where there is no text on the image. These limitations have motivated us to build a system that understands and analyzes the product images directly.
Building such a system involves several challenges though:
Scale and Variation in Catalog : E-commerce catalogs of large retailers have hundreds of millions of products, across tens of thousands of product categories. Additionally, the non-compliant content of any given type can appear across several, if not all, product categories. For example, the best-price” logo can appear on images of products from any category. Moreover, the catalog keeps on changing. Creating a big enough training set that is a true representative of the catalog is difficult and expensive.
Variety of Defining Examples : The second challenge, as shown in Figure 4, is that a single non-compliant type can appear in multiple forms (e.g., a best seller badge). Thus, the traditional methods, such as template matching, are not effective enough. We need to ensure that the system uses models that are generalized enough to capture various forms of infractions for a single use case.
Custom and Fine-Grained Class Definitions : The non-compliance guidelines often require us to capture only a certain form or variation of an object. For example, most e-commerce websites allow hunting rifles but not assault rifles to be sold. From a machine learning point of view, differentiating between assault rifle and hunting rifle images falls into the category of fine-grained classification which is challenging. Similarly, the image of a person wearing a swimwear is acceptable but a picture of a nude person which is visually close to the former is not acceptable. This also means standard object detectors that detect guns or people would not suffice to solve our problem. An even more difficult manifestation of the problem is the case where certain images (such as a swimwear) are deemed offensive because of the pose or expression of the person, but other images featuring the same person are considered acceptable. We have yet to develop an automated solution for this hardest variation of the problem.
Lack of Usable Training Data : Most of these non-compliant images are hard to find manually. In most cases, even if we collect similar images from various sources on the internet, it may be difficult to find more than a few labeled data points. One option is to manually tag images, but that is expensive for the size of our catalog. This is problematic because we need to build a model that works on hundreds of millions of products.
This paper discusses how we built an effective system despite these challenges. Section 3 focuses on how we handle the class imbalance problem in training data (challenge 4). Section 4 focuses on modeling techniques to address challenge 2 and 3. Section 5 outlines the system that addresses the scale issue (challenge 1).
3 Approaches to Building Training Data
Data augmentation is a common approach to generate additional training data. We used controlled transformation on one training image to create several similar valid training images. Some of the transformations we used included translation, flip, rotation, color/contrast adjustment and adding noise. These are, however, not sufficient, since we often start with a minuscule number of images. The following subsections describe additional novel techniques we used to generate usable training data.
3.1 Use of Visual Search in Data Generation
We have re-purposed an in-house visual search model to find images similar to a sample non-compliant image. The visual search model is an Inception-v3 based deep learning model trained on all of catalog images for the purpose of product categorization. The embeddings from this model are re-used for various classification and retrieval tasks because they are generic representations of the deep latent factors of the image. As shown in Figure 5, for every non-compliant image, we compute its embedding and then use the approximate nearest neighbor method to find images with similar embeddings. Some of these retrieved images are added to the training set after manual verification. Similar yet non-offensive images are added as valuable negative examples.
3.2 Synthetic Data Generation Using Superimposition
The visual search technique has been effective for certain use cases like nudity, assault rifles, etc. However, for use cases like logos/badges, it fails because the embedding captures the main object in an image, not the logos, which are only a very small portion of product images. So, we created synthetic training data using a novel approach as discussed below.
Find as many different-looking logos as possible. Split these logos into two sets: Train and Test. (Figure 6). This strategy helps us address the second challenge (Variety of Defining examples) and provides us with a way to know if the model has generalized well for different variations of a single non-compliant use case.
Process the logo images. This includes cropping the image to the extent where the logo is the central, tightly bounded object and the image is transparent. (Figure 6)
Superimpose these processed logos on many compliant images to make a non-compliant training/test sample. (Figure 7 - [ii] and [iii])
Apply random transformations on the logos before superimposition. These include random scaling, rotating, orienting, flipping, translating, mangling, and/ or distorting the non-compliant content. (Figure 7 - [ii] and [iii])
Also use similar-looking logos that are compliant, and add them to some training samples after random transformation. (Figure 7 - [iv])
We started with 100K compliant images sampled across the catalog, representing all product categories. We then applied the above mentioned steps, for each kind of logo/badge/racist symbols under consideration. Thus, we got 100K positive samples for each use case. Steps (d) and (e) help the model to generalize better and reduce false positive. Since the position of superimposition is known for every image, this process generates a large number of images with logos as well as accurate locations. Obtaining the locations at no cost is a big advantage of this synthetic data generation technique. This dramatically reduces the cost of image annotation, especially when training object detection models.
3.3 Weak Learners + Crowdsourcing to Augment Training Data
4 Approaches to Image Understanding
4.1 Approach 1: Deep Embedding + Shallow Classifiers
4.2 Approach 2: Fine-Tuning DeepNets
The second approach was to retrain the last few layers of pretrained deep Convolutional Neural Networks (CNN) (Figure10
). We experimented with Resnet50 and Inception-V3, which were both pretrained on Imagenet. We added a fully connected layer with softmax at the end and performed experiments where we trained the last ’n’ residual or inception layer, where n = 1/2/all layers.
Approach 1 gave us good results for nudity and assault rifles but did not work for other use cases like logos. The current Approach 2 of fine-tuning deepnets helped in those cases. However, with this methodology we got to a point where our performance plateaued and it was difficult to optimize.
4.3 Approach 3: Object Detection
Whenever the prediction from Approach 2 failed for a test sample, one of the first hypothesis was the model is still not generalizing well on different variants of the logo. While debugging deeper, we gleaned some key insights as shown in Figure 11. Image A and B are non-compliant images. While the model works perfectly fine on Image A, it is unable to detect Image B. We created Image C without a logo from Image B and then introduced the logo from Image A on Image C to get Image D. This was with the hope that this image would be flagged. However, the predicted probability for Image D still remained low. Thus, it was challenging to get a high recall on the minority class and high precision on the majority class by treating it as a binary classification problem. Technically, with a large enough dataset that is a close approximation of the entire catalog and with more positive training samples we could have solved this problem. Instead, we chose to experiment with Approach 3: which is Object Detection based system.
Since we used synthetic data generated using superimposition, we knew the bounding boxes of the non-compliant content. This allowed us to use Approach 3 without major manual adjustments. The other advantage of using Approach 3 is that the output is not only a score but also a box, making it simple for human to verify why a particular image was flagged as non-compliant. This also helps us understand the gaps in our training data and introduce samples to help reduce false positive. When Approach 3 is used for inference, it allows the manual reviewers to quickly ascertain if the image truly is non-compliant.
4.4 Iterative Training Framework
Irrespective of the approach, our iterative training framework allowed us to start from limited training data and gradually improve the model by adding examples found by the model back in to the training data.
As Figure 12
shows, the first set of training images are collected from the internal product catalog, by web search and synthetic data generation. The first version of the model (classifier/object detector) is built using this dataset. Then, a moderate number of images are run through the model with the hope of detecting more positive examples. If the model finds more of these examples with high confidence, they are removed from the website and added to the training data as well. If the model finds examples with low confidence, they are sent for manual review. In either case, the detected image is passed through a deep image similarity model (trained on our product images) to obtain a number of similar images. The model predictions on these similar images are also sent for manual review. The images with their verified labels are added to the training set and the model is retrained. A number of iterations are usually necessary to obtain the desired accuracy.
5 System Overview
Our system consists of a number of classifiers and object detectors. The simplest solution to this problem could have been a single classifier/object detector that would take an image and classify it to one of the non-compliant categories. In practice, building a single, highly-accurate model for a large number of categories is challenging because a) it is difficult to get enough training data for most categories, b) the variety of images across our catalog is incredibly high. We have addressed the issue by building several specific classifiers that are targeted at a particular non-compliance category, wherever a single multi-class model did not achieve an acceptable accuracy.
We have found that for the non-compliant categories which are more likely to appear in a few product categories, we could build a specialized classifier with higher accuracy. For example, nudity is most likely to be found in images of people, paintings, sculptures, CDs, carpets, books and posters. Assault rifle images are more likely to be found in weapons, toys, and books. Referring to Figure 13, our first-level classifier (L1) classifies an input image into one of the major types, such as a person, book, painting etc. Depending on the type of image, it is send to one or more second-level detectors (L2) that are trained to catch a particular non-compliant category. For example, an image of a person would be passed through the nudity detector, an image of a toy would go through the weapon detector, and so on. If an image does not fall into any of the product types associated with non-compliant categories, it is classified as rest’ and it does not go through any L2 detectors. The main advantage of this two-stage approach is that only a small subset of catalog images need to be passed through the L2 detectors. This strategy in conjunction with Approach 3, helps us address the challenge of scale and variation in catalog.
For the remaining non-compliance categories, a multi-class object detector based on Faster R-CNN works the best and is in production.
The Offensive/Non-Compliant Detector integrates with the rest of the components of the system as shown in Figure 14. Suppliers/marketplace sellers supply product data which is then processed and put into the e-commerce catalog. All new assets are put on a Kafka Queue. The latter queue is read by an Asset Classification Orchestrator. The orchestrator does some asset preprocessing like ensuring that the image has a valid size and format. Then, it calls all Offensive/Non-Compliant Detector services. A flagged product is stored in RDBMS. Once the images are flagged as non-compliant, they are automatically removed if the classifier’s confidence is higher than a set threshold. Images flagged with low-confidence are pushed to a manual review pipe-line. Human reviewers use a tool which is powered by RDBMS to expedite the process. Each seller/supplier is shown the Trust Safety infractions on their dashboard and they can appeal against the removal of their content. If a supplier/seller continues to publish non-compliant content despite multiple warnings, they will be removed from the platform. These detectors could be used in a Product Image Selection System as in (Chaudhuri et al., 2018).
In this section, we present all of the experiments for two representative non-compliant categories: Best Seller” logos and nudity. Experiments for other categories are similar in nature and have produced similar results and inferences.
Logo and Badge Detection: For logos and badge detection, neither traditional feature detection and template matching techniques nor approach 1 (Section 4.1: linear classification of deep embeddings) is applicable (as explained earlier). As for approach 2 (Section 4.2: fine-tuned deep nets), we experimented with pre-trained InceptionV3 by retraining the last one, two, and all inception layers. We also experimented with an in-house visual search model which is pre-trained on the entire set of catalog images. We retrained its last one and two layers, the results for which are labelled as omni_1layer and omni_2layer in Figure 15 and 16. The size of the training set is 700K and test set is 140K. As seen in Figure 15, results from the InceptionV3 and the retrained visual search model are comparable to each other. Based on the insight gained while debugging precision-recall issues, we also experimented with approach 3 (Section 4.3: object detection technique). We chose Faster R-CNN since it is known for delivering high accuracy on small objects such as logos. Faster R-CNN is one of the slower models among the popular object detection networks. Since our distributed architecture, designed based on queues allows higher inference time for image analysis, we consciously chose accuracy over inference time. Figure 16 indicates that the f1-score of the Faster R-CNN model is 100 better than the Approach 2 experiments at a confidence score of 0.85.
|Technique||Precision ( )||Recall ( )||F1-score ( )|
|3rd Party API||x||x||x|
|Deep Embedding + Shallow|
|Classifier (approach 1)||+30||+34.5||+14|
|Resnet50 (approach 2)||+32||+40||+35.5|
|Inception-V3 (approach 2)||+51||+49||+49.5|
|Object Detection (approach 3)||+55||+67||+54|
Nudity Detection: Detecting nudity/sexually explicit content is a widespread problem and there are many solutions available for purchase. Hence, we started with a commercially available third-party API that accepts an image and returns two scores, namely Adult Score and Racy Score, to quantify the offensiveness of each image. The images with the two aforementioned scores above a certain threshold were sent for manual review by crowd workers. Since the third-party API was trained on different distributions of nude/sexual images compared to those published on our platform, the model based on the logic flags mentioned above returned a large number of false positive which leads to a low precision and high recall performance. As Figure 17a suggests, the percentage of images accepted by the crowd (denoted by orange dots) is far less than that rejected by the crowd (denoted by blue dots). The FPR varies across categories (Figures 17b and 17c), but it is on the higher side regardless.
A detailed one-month study of the data revealed a few key observations: (1) the presence of actual positive instances (nude images) was concentrated on a few small segment of the catalog, (2) even within those categories, the distribution of positive and negative instances varied widely. Based on these observations, we fine-tuned the overall threshold and introduced category-specific thresholds. Even with all these changes, the best precision we could achieve was very low (below 25).
The third-party API method did not perform well, but it helped us lay out strategies to create a near optimal training set for the deep learning approaches we ventured into next. Based on crowd responses for different categories, we ensured that our training set:
has enough representatives of problematic images from the categories in which the problem is prevalent (Figure 18 shows the modified, balanced distribution for books)
has enough representative of both types of offensiveness (Adult and Racy Scores) within each category
With the carefully crafted training data, we experimented with the approaches shown in Table 1. All the experiments are run on internal datasets. The goal of these experiments is to find a method that suits our use case/data, so absolute performance numbers are not presented. Instead, in Table 1, we compare a number of candidate techniques against the baseline method (third-party API) shown in bold with an ‘x’. The results from Approach 1 (Section 4.1: deep embedding based linear model) and 2 (Section 4.2: retrained deep neural nets) are much better than the baseline. Within Approach 2, fine-tuned InceptionV3 performs better than fine-tuned Resnet50. Since Approach 1 uses a model trained on e-commerce catalog images as opposed to Approach 2 that uses a model pre-trained on Imagenet, Approach 1 generalizes newly unseen images better. Training the base model for Approach 1 is costlier, though. Approach 2, which is based on a model trained on Imagenet, can be retrained faster and with less data. In our experience, the quality and quantity of the data has a greater impact than the modelling choice. Our final observation: Approach 3 (object detection), outperforms the rest. The manual reviewers annotated about 50K images with bounding boxes for the nudity use case. Roughly 40 of those images serve as positive training labels. With a limited number of images, we retrained YOLO v3 (pretrained on Coco dataset) and tested it on about 8K test examples to achieve a 55 lift in precision from the baseline.
7 Related Work
The importance of images in e-commerce is well studied. Online shoppers often use images as the first level of information. Also, item popularity highly depends on the image quality (Zakrewsky et al., 2016). (Di et al., 2014) provides deeper understanding of the roles images play in e-commerce and shows evidence that better images can lead to an increase of buyers’ attention, trust, and conversion rates.
Image classification models based on skin detection techniques (Arentz & Olstad, 2004) have been proposed for nudity detection. Skin regions are detected based on color, texture, contour, and shape information features. (Zheng et al., 2004) uses maximum entropy distribution to detect the skin regions in the image.
Traditionally, logo/badge recognition has been addressed by keypoint-based detectors and descriptors (Kalantidis et al., 2011; Romberg & Lienhart, 2013; Joly & Buisson, 2009), feature detection (using SIFT, SURF, BRIEF, ORB), and feature matching (using Brute-Force, FLANN matcher) (Mordvintsev & K, 2013a). Although we tried several of those options as well as classical template matching (Mordvintsev & K, 2013b), those approaches did not work well for a catalog that contains millions of products. It was unable to detect a simple logo like the one shown in Figure 19. A few deep learning based logo detection models have been reported recently (Bianco et al., 2017; Su et al., 2017; Eggert et al., 2015; Iandola et al., 2015). All of these techniques are tested on publicly available brand-logo datasets like BelgaLogos (Joly & Buisson, 2009) or FlickrLogos-32 dataset (Romberg et al., 2011).
Recent advances in deep learning have brought neural nets to the forefront of image classification. A number of deep learning architectures such as Alexnet (Krizhevsky et al., 2012), VGG net (Simonyan & Zisserman, 2014), residual network (He et al., 2016), Inception (Szegedy et al., 2014, 2015), and Nasnet (Zoph et al., 2017) have been proposed. In this paper, we experimented with Resnet and Inception architectures that were pre-trained on an Imagenet (Deng et al., 2009) dataset, and then retrained on our images.
Object detection deals with detecting instances of semantic objects of a certain class and identifying the location of them. There are several well-known object detection methods such as SSD(Liu et al., 2016), Region-based object detection(Girshick et al., 2016), YOLO(Redmon et al., 2016), R-FCN(Dai et al., 2016), and Faster R-CNN(Ren et al., 2015).
Generating synthetic training data allows for expanding plausibly ground-truth annotations without the need for exhaustive manual labelling. This strategy has been shown to be effective for training large CNN models, particularly in situations where no sufficient training data is available (Dosovitskiy et al., 2015; Eggert et al., 2015).
In this paper, we presented a machine-learning driven system that detects and removes offensive and non-compliant images from an e-commerce catalog of hundreds of millions of items. In addition to describing the core modeling components of the system, we discussed the technical challenges of building a system at such a scale, namely, lack of adequate training data, an extreme class imbalance and a changing test distribution. We also described the critical refinements made to the data and to the modeling techniques to effectively overcome the challenges. The presented system is already deployed in production and it has processed millions of product images.
We continue to improve the system as the catalog and the problem definitions evolve. We plan to combine image and textual signals from products to build a more effective model. We are in the process of turning the system into an AutoML framework where new models can be trained quickly when new types of non-compliant items surface.
The strategies adopted and the insights gained can be leveraged by content-serving web-based platforms from other domains as well. Many social, advertising platforms need to constantly weed out offensive material. At a high level, our work contributes to the broader goal of making the internet a safer place.
- Arentz & Olstad (2004) Arentz, W. A. and Olstad, B. Classifying offensive sites based on image content. Computer Vision and Image Understanding, 94(1-3):295–310, 2004.
- Bianco et al. (2017) Bianco, S., Buzzelli, M., Mazzini, D., and Schettini, R. Deep learning for logo recognition. Neurocomputing, 245:23–30, 2017.
- Chaudhuri et al. (2018) Chaudhuri, A., Messina, P., Kokkula, S., Subramanian, A., Krishnan, A., Gandhi, S., Magnani, A., and Kandaswamy, V. A smart system for selection of optimal product images in e-commerce. In 2018 IEEE International Conference on Big Data (Big Data), pp. 1728–1736. IEEE, 2018.
- Dai et al. (2016) Dai, J., Li, Y., He, K., and Sun, J. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387, 2016.
- Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
- Di et al. (2014) Di, W., Sundaresan, N., Piramuthu, R., and Bhardwaj, A. Is a picture really worth a thousand words?: - on the role of images in e-commerce. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM ’14, pp. 633–642, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2351-2. doi: 10.1145/2556195.2556226.
- Dosovitskiy et al. (2015) Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., and Brox, T. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 2758–2766, 2015.
- Eggert et al. (2015) Eggert, C., Winschel, A., and Lienhart, R. On the benefit of synthetic data for company logo detection. In Proceedings of the 23rd ACM international conference on Multimedia, pp. 1283–1286. ACM, 2015.
- Girshick et al. (2016) Girshick, R., Donahue, J., Darrell, T., and Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE transactions on pattern analysis and machine intelligence, 38(1):142–158, 2016.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Iandola et al. (2015) Iandola, F. N., Shen, A., Gao, P., and Keutzer, K. Deeplogo: Hitting logo recognition with the deep neural network hammer. arXiv preprint arXiv:1510.02131, 2015.
- Joly & Buisson (2009) Joly, A. and Buisson, O. Logo retrieval with a contrario visual query expansion. In Proceedings of the 17th ACM International Conf. on Multimedia, pp. 581–584, 2009.
- Kalantidis et al. (2011) Kalantidis, Y., Pueyo, L. G., Trevisiol, M., van Zwol, R., and Avrithis, Y. Scalable triangulation-based logo recognition. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, pp. 20. ACM, 2011.
- Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc., 2012.
- Liu et al. (2016) Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., and Berg, A. C. Ssd: Single shot multibox detector. In European conference on computer vision, pp. 21–37. Springer, 2016.
- Mordvintsev & K (2013a) Mordvintsev, A. and K, A. Feature matching. https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_matcher/py_matcher.html, 2013a.
- Mordvintsev & K (2013b) Mordvintsev, A. and K, A. Template matching. https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_template_matching/py_template_matching.html, 2013b.
- Nielsen (1997) Nielsen, J. How users read on the web. https://www.nngroup.com/articles/how-users-read-on-the-web/, 1997.
- Redmon et al. (2016) Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788, 2016.
- Ren et al. (2015) Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99, 2015.
- Romberg & Lienhart (2013) Romberg, S. and Lienhart, R. Bundle min-hashing for logo recognition. In Proceedings of the 3rd ACM conference on International conference on multimedia retrieval, pp. 113–120. ACM, 2013.
- Romberg et al. (2011) Romberg, S., Pueyo, L. G., Lienhart, R., and Van Zwol, R. Scalable logo recognition in real-world images. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, pp. 25. ACM, 2011.
- Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Su et al. (2017) Su, H., Zhu, X., and Gong, S. Deep learning logo detection with data expansion by synthesising context. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 530–539. IEEE, 2017.
- Szegedy et al. (2014) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. E., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
- Szegedy et al. (2015) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
- Zakrewsky et al. (2016) Zakrewsky, S., Aryafar, K., and Shokoufandeh, A. Item popularity prediction in e-commerce using image quality feature vectors. CoRR, abs/1605.03663, 2016.
- Zheng et al. (2004) Zheng, H., Liu, H., and Daoudi, M. Blocking objectionable images: adult images and harmful symbols. In 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763), volume 2, pp. 1223–1226. IEEE, 2004.
- Zoph et al. (2017) Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition. CoRR, abs/1707.07012, 2017.