1 Introduction and related work
Designing a logo for a new brand usually is a lengthy and tedious process, both for the client and the designer. A lot of ultimately unused drafts are produced, from which the client selects his favorites, followed by multiple cycles refining the logo to match the clients needs and wishes. Especially for those clients without a specific idea of the end product, this results in a procedure that is not only time, but also cost intensive.
The goal of this work is to provide a framework towards a system with the ability to generate (virtually) infinitely many variations of logos (some examples are shown in Figure 1) to facilitate and expedite such a process. To this end, the prospective client should be able to modify a prototype logo according to specific parameters like shape and color, or shift it a certain amount towards the characteristics of another prototype. An example interface for such a system is presented in Figure 2. It could help both designer and client to get an idea of a potential logo, which the designer could then build upon, even if the system itself was not (yet) able to output production-quality designs.
Logo image data
Existing research literature focused mostly on retrieval, detection, and recognition of a reduced number of logos [14, 17, 30, 32, 34, 42] and, consequently, a number of datasets were introduced. The most representative large public logo datasets are shown in Table 1. Due to the low diversity of the contained logos, these datasets are not suitable for learning and validating automatic logo generators. At the same time a number of web pages allow (paid) access to a large number of icons, such as iconsdb.com (4135+ icons), icons8.com (59900+), iconfinder.com (7473+), iconarchive.com (450k+) and thenounproject.com (1m+). However, the diversity of these icons is limited by the number of sources, namely designers/artists, themes (categories) and design patterns (many are black and white icons). Therefore, we crawl a highly diverse dataset – the Large Logo Dataset (LLD) – of real logos ‘in the wild’ from the Internet. As shown in Table 1 our LLD proposes thousands of times more distinct logos than the largest public logo dataset to date, WebLogo-2M .
In contrast to popularly used natural image datasets such as ImageNet , CIFAR-10  and LSUN , face datasets like CelebA  and the relatively easily modeled handwritten digits of MNIST , logos are: (1) Artificial, yet strongly multimodal and thus challenging for generative models; (2) Applied, as there is an obvious real-world demand for synthetically generated, unique logos since they are expensive to produce; (3) Hard to label, as there are very few categorical properties which manifest themselves in a logo’s visual appearance. While the logos are easily obtainable in large quantities, they are specifically designed to be unique, which ensures the diversity of a large logo dataset. We argue that all these characteristics make logos a very attractive domain for machine learning research in general, and generative modeling in particular.
Recent advances in generative modeling have provided viable frameworks for making such a system possible. The current state-of-the-art is made up mainly of two types of generative models, namely Variational Autoencoders (VAEs)[16, 19, 20] and Generative Adversarial Networks (GANs) [2, 10, 11]. Both of these models generate their images from a high-dimensional latent space that can act as a sort of “design space” in which a user is able to modify the output in a structured way. VAEs have the advantage of directly providing embeddings of any given image in the latent space, allowing targeted modifications to its reconstruction, but tend to suffer from blurry output owed to the nature of the pixel-wise loss used during training. GANs on the other hand, which consist of a separate generator and discriminator network trained simultaneously on opposing objectives in a competitive manner, are known to provide realistic looking, crisp images but are notoriously unstable to train. To address this difficulty, a number of improvements in the architecture and training methods of GANs have been suggested , such as using deep convolutional layers 
or modified loss functions e.g. based on least-squares
or the Wasserstein distance between probability distributions[3, 4, 12].
The first extension of GANs with class-conditional information  followed shortly after its inception, generating MNIST digits conditioned on class labels provided to both generator and discriminator during training. It has since been shown for supervised datasets, that class-conditional variants of generative networks very often produce superior results compared to their unconditional counterparts [12, 15, 26]. By adding an encoder to map a real image into the latent space, it was proven to be feasible to generate a modified version of the original image by changing class attributes on faces [6, 27] and other natural images . Other notable applications include the generation of images from a high-level description such as various visual attributes  or text descriptions .
In this work we train GANs on our own highly multi-modal logo data as a first step towards user-manipulated artificial logo synthesis. Our main contributions are:
LLD - a novel dataset of 600k+ logo images.
Methods to successfully train GAN models on multi-modal data. Our proposed clustered GAN training achieves state-of-the-art Inception scores on the CIFAR10 dataset.
An exploration of GAN latent space for logo synthesis.
The remainder of this paper is structured as follows. We introduce a novel Large Logo Dataset (LLD) in Section 2. We describe the proposed clustered GAN training, the clustering methods, as well as the GAN architectures used and perform quantitative experiments in Section 3. Then we demonstrate logo synthesis by latent space exploration operations in Section 4. Finally, we draw the conclusions in Section 5.
2 LLD: Large Logo Dataset
In the following we introduce a novel dataset based on website logos, called the Large Logo Dataset (LLD). It is the largest logo dataset to date (see Table 1). The LLD dataset consists of two parts, a low resolution (3232 pixel) favicon subset (LLD-icon) and the higher-resolution (400400 pixel) twitter subset (LLD-logo). In the following we will briefly describe the acquisition, properties and possible use-cases for each. Both versions will be made available at https://data.vision.ee.ethz.ch/cvl/lld/.
2.1 LLD-icon: Favicons
For generative models like GANs, the difficulty of keeping the network stable during training increases with image resolution. Thus, when starting to work with a new type of data, it makes sense to start off with a variant which is inherently low-resolution. Luckily, in the domain of logo images there is a category of such inherently low-resolution, low-complexity images: Favicons, the small icons representing a website e.g. in browser tabs or favorite lists. We decided to crawl the web for such favicons using the largest resource of high quality website URLs we could find: Alexa’s top 1-million website list111now officially retired, formerly available at https://www.alexa.com. To this end we use the Python package Scrapy222https://scrapy.org/ in conjunction with our own download script which directly converts all icons found to a standardized pixel resolution and RGB color space, discarding all non-square images.
After acquiring the raw data from the web, we remove all exact duplicates (of which there are a surprisingly high number of almost 20 %). Visual inspection of the raw data reveals a non-negligible number of images that do not comply to our initial dataset criteria and often are not even remotely logo-like, such as faces and other natural images. In an attempt to get rid of this unwanted data, we (i) sort all images by PNG-compressed file size – an image complexity indicator; (ii) manually inspect and partition the resulting sorted list into three sections: clean and mostly clean data which are kept, and mostly unwanted data which is discarded; (iii) discard the mostly clean images containing the least amount of white pixels.
The result of this process, a small sample of which is given in Figure 3, is a clean set of 486,377 images of uniform 3232 pixel size, making it very easy to use. The disadvantage of this standardized size is that 54 % of images appear blurry because they where scaled up from a lower resolution. For this reason we will also be providing (the indices for) a subset of the data containing only sharp images, which we will refer to as icons-sharp.
2.2 LLD-logo: Twitter
For training generative networks at an increased resolution, additional high-resolution data is needed, which favicons cannot provide. One possible option would be to crawl the respective websites directly to look for the website or company logo. However, (a) it might not always be straight-forward to find the logo and distinguish it from other images on the website and (b) the aspect ratio and resolution of logos obtained in this way will be very varied, which would necessitate extensive cropping and resizing, potentially degrading the quality of a large portion of logos.
By crawling twitter instead of websites, we are able to acquire standardized square 400400 pixel profile images which can easily be downloaded through the twitter API without the need for web scraping. We use the Python wrapper tweepy to search for the (sub-) domain names contained in the alexa list and match the original URL with the website provided in the twitter profile to make sure that we have found the right twitter user. The images are then run through a face detector to reject any personal twitter accounts and the remaining images are saved together with the twitter meta data such as user name, number of followers and description. For this part of the dataset, all original resolutions are kept as-is, where 80% are at 400400 pixels and the rest at some lower resolution (details given in supplementary material).
The acquired images are analyzed and sorted with a combination of automatic and manual processing in order to get rid of unwanted and possibly sensitive images, resulting in 122,920 usable high-resolution logos of consistent quality with rich meta data from the respective twitter accounts. These logo images form the LLD-logo dataset, a small sample of which is presented in Figure 4.
3 Clustered GAN Training
We propose a method for stabilizing GAN training and achieving superior quality samples on unlabeled datasets by means of clustering (a) in the latent space of an autoencoder trained on the same data or (b) in the CNN feature space of a ResNet classifier trained on ImageNet. With both methods we are able to produce semantically meaningful clusters that improve GAN training.
In this Section we review the GAN architectures used in our study, describe the clustering methods based on Autoencoder latent space and ResNet features and discuss the quantitative experimental results.
3.1 GAN architectures
Our generative models are based on Deep Convolutional Generative Adversarial Networks (DCGAN) of Radford et al.  and improved Wasserstein GAN with gradient penalty (iWGAN) as proposed by Gulrajani et al. .
For our DCGAN experiments, we use Taehoon Kim’s TensorFlow implementation333https://github.com/carpedm20/DCGAN-tensorflow. We train DCGAN exclusively on the low-resolution LLD-icon subset, for which it proved to be inherently unstable without using our clustering approach. We use the input blurring explained in the next section in all our DCGAN experiments. For details on hyper-parameters used, we refer the interested reader to the supplementary material.
All our iWGAN experiments are based on the official TensorFlow repository by Gulrajani et al. 444https://github.com/igul222/improved_wgan_training. We kept the default settings as provided by the authors. We exclusively use the 32- and 64-pixel ResNet architectures provided in the repository with the only major modifications being our conditioning method as described below. We also use linear learning rate decay (from the initial value to zero over the full training iterations) for all our experiments.
As mentioned in the introduction (Section 1), training a conditional GAN with labels is beneficial in terms of improved output quality over an unsupervised setting. In particular, we found DCGAN to be unstable with our icon dataset (LLD-icon) for resolutions higher than 1010, and where able to stabilize it by introducing synthetic labels as described in this section. In addition to stabilizing GAN training, we are able to achieve state-of-the-art Inception scores (as proposed by Salimans et al. ) on CIFAR-10 using iWGAN with our synthetic labels produced by RC clustering as described below, and thus demonstrate quantitative evidence of a quality improvement using this approach in Section 3.4. Furthermore, the cluster labels subsequently allow us to have some additional control over the generated logos by generating samples from individual clusters or transforming an particular logo to inherit the specific attributes of another cluster as demonstrated in Section 4.
AE: AutoEncoder Clustering
Our first proposed method for producing synthetic data labels is by means of clustering in the latent space of an Autoencoder. We construct an Autoencoder, consisting of a modified version of the GAN discriminator with outputs instead of one, acting as an encoder to latent space, and the unmodified GAN Generator acting as a decoder for the reconstruction of the image from the latent representation, as illustrated in Figure 5. This Autoencoder is trained using a simple
loss between original and reconstructed image. All images are then encoded to latent vectors, followed by a PCA dimensionality reduction and finally clustered using (mini-batch) k-means. For our logo data, this produces clusters that are both semantically meaningful, as they are based on high-level AE features, and recognizable by the GAN because they where created using the same general network topology.
RC: ResNet Classifier Clustering
For our second clustering method we leverage the learned features of an ImageNet classifier, namely ResNet-50 by He et al. . We feed our images to the classifier and extract the output of the final pooling layer from the network to get a 2048-dimensional feature vector. After a PCA dimensionality reduction we can cluster our data in this feature space with (minibatch) k-means. The obtained clusters are considerably superior to those produced with our AE clustering method on CIFAR-10, where one could argue that we are benefiting from the similarity in categories between ImageNet and CIFAR-10, and are thus indirectly using labeled data. However, the clustering is very meaningful also on our logo dataset, which has a very different content and does not consist of natural images like ImageNet, proving the generality of this approach.
3.3 Conditional GAN Training Methods
In this section we describe the conditional GAN models used to leverage our synthetic data labels and the input blurring applied to DCGAN.
LC: Layer Conditional GAN
In our layer-conditional models, the cluster label for each training sample is fed to all convolutional and linear layers of both generator and discriminator. For linear layers it is simply appended to the input as a one-hot vector. For convolutional layers the labels are projected onto “one-hot feature maps” with as many channels as there are clusters, where the one corresponding to the cluster number is filled with ones, while the rest are zero. These additional feature maps are appended to the input of every convolutional layer, such that every layer can directly access the label information. This is illustrated in Figure 7 for DCGAN and Figure 6
for ResNet as used in our iWGAN model. Even though the labels are provided to every layer, there is no explicit mechanism forcing the network to use this information. In case the labels are random or meaningless, they can simply be ignored by the network. However, as soon as the discriminator starts adjusting its criteria for each cluster, it forces the generator to produce images that comply with the different requirements for each class. Our experiments confirm that visually meaningful clusters are always picked up by the model, while the network simply falls back to the unconditional state for random labels. This type of class conditioning has some useful properties such as the ability to interpolate between different classes and is less prone to failure in producing class-conditional samples compared to the AC conditioning described below. However, it does come with the drawback of adding a significant number of parameters, especially to low-resolution networks, when there are a large number of classes. This effect diminishes with larger networks containing more feature maps, as the number of added parameters remains constant.
AC: Auxiliary Classifier GAN
With iWGAN we also use the Auxiliary Classifier proposed by Odena et al.  as implemented by Glurajani et al. . While this method does not allow us to interpolate between clusters and is thus slightly more limited from an application perspective, it does avoid adding parameters to the convolutional layers, which in general results in a network with fewer parameters. iWGAN-AC was our method of choice for CIFAR-10, as it delivers the highest Inception scores.
During our experiments we noticed how blurring the input image helps the network remain stable during training, which in the end lead us to apply a Gaussian blur on all images presented to the discriminator (training data as well as samples from the Generator), like it has been previously implemented by Susmelj et al. . The method is schematically illustrated in Figure 8. Upscaling the images to 6464 pixel resolution before convolving them with the Gaussian kernel enables us to train with blurred images while preserving almost all of the image’s sharpness when scaled back down to the original resolution of 3232 pixels. When generating image samples from the trained Generator without applying the blur filter, there is some noticeable noise in the images, which becomes imperceptible after resizing to the original data resolution while producing almost perfectly sharp output images. Based on our experimental experience we believe this to produce higher quality samples and help stability, it is however not strictly necessary to achieve stability with DCGAN when using clustered training.
3.4 Quantitative evaluation and state-of-the-art
In order to quantitatively assess the performance of our solutions on the commonly used CIFAR-10 dataset we report Inception scores  and diversity scores based on MS-SSIM  as suggested in  over a set of 50000 randomly generated images. In Table 2 we summarize results for different configurations in supervised (using CIFAR class labels) and unsupervised settings in LC and AC conditional modes, including reported scores from the literature.
|iWGAN-LC with AE clustering||32||7.3000.072||0.05070.0016|
|iWGAN-LC with RC clustering||32||7.8310.072||0.04910.0015|
|iWGAN-LC with RC clustering||128||7.7990.030||0.04910.0015|
|iWGAN-AC with AE clustering||32||7.8850.083||0.05040.0014|
|iWGAN-AC with RC clustering||10||8.4330.068||0.05050.0016|
|iWGAN-AC with RC clustering||32||8.6730.075||0.05000.0016|
|iWGAN-AC with RC clustering||128||8.6250.109||0.04650.0015|
|CIFAR-10 (original data)||11.2370.116||0.04850.0016|
|DCGAN-LC with AE clustering||100||62.120.51||0.04750.0013|
|iWGAN-LC with AE clustering||100||60.240.61||0.04390.0010|
|*iWGAN-LC with RC clustering||16||55.370.67||0.04900.0014|
|*iWGAN-LC with RC clustering||128||55.270.68||0.04840.0010|
|LLD-icon (original data)||61.000.62||0.04820.0014|
|*LLD-icon-sharp (original data)||55.370.67||0.04940.0011|
On CIFAR-10, increasing the number of RC clusters from 1 to 128 leads to better diversity scores for iWGAN-AC, at the same time the Inception score peaks above 32 clusters. We note that using RC clustering leads to better performance than using AE clustering.
Performance and state-of-the-art
Our best Inception score of achieved with iWGAN-AC and 32 RC clusters is significantly higher than by Salimans et al.  with their Improved GAN method, the best score reported in the literature for unsupervised methods. Surprisingly, our best result, achieved with unsupervised synthetic labels provided by RC clustering, is comparable to of the Stacked GANs approach by Huang et al. , the best score reported for supervised methods.
Complementary to the Inception and diversity scores we also measured the image quality using CORNIA, a robust no-reference image quality assessment method proposed by Ye and Doermann . On both CIFAR-10 and LLD-icon our generative models obtained CORNIA scores equivalent to those of the original images from each dataset. This result is in-line with the findings in , where the studied GANs also converge in terms of CORNIA scores towards the data image quality at GAN convergence. We show the CORNIA and MS-SSIM scores for the LLD-icon dataset, as a complement to the Inception scores on CIFAR-10, in Table 3.
LC vs. AC for conditional GANs
Our AC-GAN variants are better than their LC counterparts in terms of Inception scores, but comparable in terms of diversity for CIFAR-10. We believe that this is owed to fact that AC-GAN enforces the generation of images which can easily be classified to the provided clusters, which in turn could raise the classifier-based Inception score. Even though the numbers indicate a qualitative advantage of AC- over LC-GAN, we prefer the latter for our logo application as it allows smooth interpolations even in-between different clusters. This is not possible in the standard AC-GAN implementation since the cluster labels are discrete integer values and thus all our desirable latent space operations would be constrained to be performed within a specific data cluster, which does not match our intended use.
4 Logo synthesis by latent space exploration
As mentioned in the previous section, layer conditioning allows for smooth transitions in the latent space from one class to another, which is critical for logo synthesis and manipulation by exploration of the latent space. Therefore, we work with two configurations for these experiments: iWGAN-LC with 128 RC clusters and DCGAN-LC with 100 AE clusters. Their Inception, diversity and CORNIA scores are comparable on the LLD-icon dataset.
, images are generated from a high-dimensional latent vector (with usually somewhere between 50 and 1000 dimensions), also commonly referred to as z-vector. During training, each component of this vector is randomly sampled from a Uniform or Gaussian distribution, so that the generator is trained to produce a reasonable output for any random vector sampled from the same distribution. The space spanned by these latent vectors, called the latent space, is often highly structured, such that latent vectors can be deliberately manipulated in order to achieve certain properties in the output[6, 8, 28].
Using DCGAN-LC with 100 AE clusters on the same data, Figure 9 contains samples from a specific cluster next to a sample of the respective original data. This shows how the layer conditional DCGAN is able to pick up on the data distribution and produce samples which are very easy to attribute to the corresponding cluster and are often hard to distinguish from the originals at first glance. For comparison we also show results for iWGAN-LC with 128 RC clusters trained on the LLD-icon-sharp dataset in Figure 1.
To show that a generator does not simply learn to reproduce samples from the training set, but is in fact able to produce smooth variations of its output images, it is common practice  to perform interpolations between two points in the latent space and to show that the outcome is a smooth transition between the two corresponding generated images, with all intermediate images exhibiting the same distribution and quality. Interpolation also provides an effective tool for a logo generator application, as the output image can be manipulated in a controlled manner towards a certain (semantically meaningful) direction in latent space.
For all our interpolation experiments we use the distribution matching methods from  in order to preserve the prior distribution the sampled model was trained on. An example with 64 interpolation steps to showcase the smoothness of such an interpolation is given in Figure 10 where we interpolate between 4 sample points, producing believable logos at every step. As it is the case in this example, the interpolation works very well even between logos of different clusters, even though the generator was never trained for mixed cluster attributes.
Some more interpolations between different logos both within a single cluster and between logos of different clusters are shown in Figure 11, this time between 2 endpoints and with only 8 interpolation steps.
4.3 Class transfer
As the one-hot class vector representing the logo cluster is separate from our latent vector, it is also possible to keep the latent space representation constant and only change the cluster of a generated logo. Figure 12 contains 11 logos (top row) that are being transformed to a particular cluster class in each subsequent row. This shows how the general appearance such as color and contents are encoded in the z-vector while the cluster label transforms these attributes into a form that conforms with the contents of the respective cluster. Here, again, interpolation could be used to create intermediate versions as desired.
4.4 Vicinity sampling
Another powerful tool to explore the latent space is vicinity sampling, where we perturb a given sample in random directions of the latent space. This could be useful to present the user of a logo generator application with a choice of possible variants, allowing him to modify his logo step by step into directions of his choice. In Figure 13 we present an example of a 2-step vicinity sampling process, where we interpolate one-third towards random samples to produce a succession of logo variants.
4.5 Vector arithmetic example: Sharpening
For models trained on our LLD-icon data, some of the generated icons are blurry since roughly half of the logos in this dataset are upscaled from a lower resolution. However, by averaging over the z-vector of a number of blurry samples and subtracting from this the average of a number of sharp samples, it is possible to construct a “sharpening” vector which can be added to blurry logos to transform them into sharp ones. This works very well even if the directional vector is calculated exclusively from samples in one cluster and then applied samples of another, showing that the blurriness is in fact nothing more than a feature embedded in latent space. The result of such a transformation is shown in Figure 14, where such a sharpening vector was calculated from 40 sharp and 42 blurry samples manually selected from two random batches of the same cluster. The resulting vector is then applied equally to all blurry samples. The quality of the result, while already visually convincing, could be further optimized by adding individually adjusted fractions of this sharpening vector to each logo.
This example of adding a sharpening vector to the latent representation is only one of many latent space operations one could think of, such as directed manipulation of form and color as performed in the supplementary material.
In this paper we tackled the problem of logo design by synthesis and manipulation with generative models:
We introduced a Large Logo Dataset (LLD) crawled from Internet with orders of magnitude more logos than the existing datasets.
In order to cope with the high multi-modality and to stabilize GAN training on such data we proposed clustered GANs, that is GANs conditioned with synthetic labels obtained through clustering. We performed clustering in the latent space of an Autoencoder or in the CNN features space of a ResNet classifier and conditioned DCGAN and improved WGAN utilizing either an Auxiliary Classifier or Layer Conditional model.
We quantitatively validated our clustered GAN approaches on a CIFAR-10 benchmark where we set a clear state-of-the-art Inception score for unsupervised generative models, showcasing the benefits of meaningful synthetic labels obtained through clustering in the CNN feature space of a an ImageNet classifier.
We showed that the latent space of the networks trained on our logo data is smooth and highly structured, thus having interesting properties exploitable by performing vector arithmetic in that space.
We showed that the synthesis and manipulation of (virtually) infinitely many variations of logos is possible through latent space exploration equipped with a number of operations such as interpolations, sampling, class transfer or vector arithmetic in latent space like our sharpening example.
Our solutions ease the logo design task in an interactive manner and are significant steps towards a fully automatic logo design system.
For more results, operations, and settings the reader is invited to consult the supplementary material.
-  E. Agustsson, A. Sage, R. Timofte, and L. Van Gool. Optimal transport maps for distribution preserving operations on latent spaces of generative models. arXiv preprint arXiv:1711.01970, 2017.
-  M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862, 2017.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
-  D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
-  F. Bordes, S. Honari, and P. Vincent. Learning to generate samples from noise through infusion training. arXiv preprint arXiv:1703.06975, 2017.
-  A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093, 2016.
-  Z. Dai, A. Almahairi, P. Bachman, E. Hovy, and A. Courville. Calibrating energy-based generative adversarial networks. arXiv preprint arXiv:1702.01691, 2017.
A. Dosovitskiy, J. Tobias Springenberg, and T. Brox.
Learning to generate chairs with convolutional neural networks.In , pages 1538–1546, 2015.
-  V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
-  I. Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  S. C. HOI, X. WU, H. LIU, Y. Wu, H. Wang, H. Xue, and Q. Wu. Large-scale deep logo detection and brand recognition with deep region-based convolutional networks. In arXiv, 2015.
-  X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. arXiv preprint arXiv:1612.04357, 2016.
-  D. Jimenez Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
-  A. Joly and O. Buisson. Logo retrieval with a contrario visual query expansion. In Proceedings of the 17th ACM international conference on Multimedia, pages 581–584. ACM, 2009.
-  Y. Kalantidis, L. G. Pueyo, M. Trevisiol, R. van Zwol, and Y. Avrithis. Scalable triangulation-based logo recognition. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, page 20. ACM, 2011.
-  D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
-  X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. arXiv preprint ArXiv:1611.04076, 2016.
-  M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
-  A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. 2017.
-  G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Álvarez. Invertible conditional gans for image editing. arXiv preprint arXiv:1611.06355, 2016.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1060–1069, New York, New York, USA, 20–22 Jun 2016. PMLR.
-  S. Romberg, L. G. Pueyo, R. Lienhart, and R. Van Zwol. Scalable logo recognition in real-world images. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, page 25. ACM, 2011.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  H. Sahbi, L. Ballan, G. Serra, and A. Del Bimbo. Context-dependent logo matching and recognition. IEEE Transactions on Image Processing, 22(3):1018–1031, 2013.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
-  H. Su, S. Gong, and X. Zhu. Weblogo-2m: Scalable logo detection by deep learning from the web. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
-  I. Susmelj, E. Agustsson, and R. Timofte. ABC-GAN: Adaptive blur and control for improved training stability of generative adversarial networks. International Conference on Machine Learning (ICML 2017) Workshop on Implicit Models, 2017.
-  A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016.
-  Z. Wang, E. P. Simoncelli, and A. C. Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003, volume 2, pages 1398–1402 Vol.2, Nov 2003.
-  D. Warde-Farley and Y. Bengio. Improving generative adversarial networks with denoising feature matching. In International Conference on Learning Representations, 2017.
-  X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2Image: Conditional Image Generation from Visual Attributes, pages 776–791. Springer International Publishing, Cham, 2016.
-  P. Ye and D. Doermann. No-reference image quality assessment using visual codebooks. IEEE Transactions on Image Processing, 21(7):3129–3138, 2012.
-  F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
-  G. Zhu and D. Doermann. Automatic document logo detection. In Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, volume 2, pages 864–868. IEEE, 2007.
The following pages contain the supplementary material for this paper.
After presenting some latent space exploration experiments with LLD-logo in Section A, we give some additional details on the data collection process as well as the final contents of our LLD datasets in Section B. We then proceed to show in Section C, for each subset of our Large Logo Dataset, an excerpt of the collected data together with generated samples from selected GAN architectures and the clusters produced by the applied clustering methods. For CIFAR-10 we also show samples from our cluster-conditional CIFAR-10 models together with samples from the unconditional and supervised iWGAN variants in this section. Finally, we give some details on architecture and training hyper-parameters of our models in Section D..
Appendix A Latent space exploration on LLD-logo
In this section, we present some interpolations on the LLD-logo dataset and perform two additional experiments with latent space operations.
In Figure 15 we present two examples of interpolations between 4 different samples, representing a small section of the high-dimensional logo manifold created by the GAN.
First, we define two desirable operations we would like to perform (1) Color shifts from red to blue and blue to red and (2) Shape changes from square to round and round to square. For each of those semantic operations we identify a number (for our experiments around 30) of samples that match our criteria. To get operation (1) this means we select 30 red and 30 blue logos. We then construct a directional vector by subtracting the mean latent space vector of all blue logos from the mean latent space vector of all red logos, which gives us a directional vector from red to blue. Since some of these semantic attributes are expected to be encoded in the cluster labels as well, we can do the same with our one-hot encoded class vectors, which we can view as an additional cluster space. In Figure16 we add this directional vector to a new random batch of generated logos. If we subtract the directional vector, we get a shift in the opposite direction, i.e. from blue to red. To find out how much of the color information is encoded in the latent representation and in the clusters respectively, we can perform the operation in only one of these domains. This is done in Figure 17 for the red-shift, where we observe a very similar behavior for both spaces, indicating that the color information is equally encoded in both latent space and labels.
Our second experiment is performed in the same way, and the directional vector is applied to the same batch of samples. Figure 18, again, shows the result for a simultaneous addition of both (latent and class) vectors in each direction, whereas each space is considered individually in Figure 19 for the directional vectors towards round logos. Here we can observe that some logos respond better to the change in latent space, while others seem more responsive to a changing cluster label. Overall, the label information seems to be a little stronger in this case.
In both experiments, the combined shift clearly performs best, and could provide a powerful tools for logo manipulation and other applications.
Appendix B LLD crawling and image statistics
When collecting the favicons for LLD-icon, our download script directly converted all icons found to a standardized 32x32 pixel resolution and RGB color space, discarding all non-square images. After acquiring the raw data from the web, we remove all exact duplicates and perform a three-stage clean-up process:
Sort all images by complexity by evaluating its PNG-compressed file size
Manually inspect and partition the resulting sorted list into three sections: Clean, mostly clean and mostly unwanted data. The last section is discarded, while the middle part (mostly clean) is further processed in the next step.
Sort the intermediate section according to the number of white pixels in each image and cut off at a certain point after inspection, discarding the images containing the least amount of white pixels.
Table 4 shows statistics on the crawling process, original image resolutions the icons where rescaled from, and numbers on content removed through our clean-up process.
During the collection of LLD-logo on twitter, we use a face detector recognize faces and proceed to the next user in the search results if a face was detected. At the same time, we make use of twitters (relatively new) sensitive content flag to reject such flagged profiles. As the number of rejected profiles in Table5 compared to the number of discarded images during cleanup (of which a substantial number where due to sensitive content) shows, this flag is only used very sporadically at this time, and is far from a reliable indicator. Figure 20 shows a histogram of image resolutions contained in LLD-icon (where no re-scaling was performed during data collection), with the top-5 image resolutions (amounting to 92% of images) given in Table 6.
|Unable to process|
|Total images saved|
|Native 32 p|
|Discarded due to content|
|Clean dataset size|
|Flagged content ignored|
|Discarded during cleanup|
|Final dataset size|
|Image height (px)||Number of images||of total|
Appendix C Logo Data, clusters and generated samples
In this section, we will show a small sample from each of our introduced datasets and present generated icons from models trained on said dataset. Additionally, we show the data clusters produced by our clustering methods.
Starting with LLD-logo, Figure 21 shows a sample of the original data collected (reduced to 6464 pixels) next to the logos generated by an iWGAN model trained at 6464 pixels. Compared to LLD-icon, these logos contain a lot more text and sometimes more detailed images. Both of these features are recreated nicely by the model, where the text is often (but not always) illegible while still of a realistic appearance. We expect the legibility of the text to be much higher if our data would not contain a lot of non-latin (e.g. Chinese) characters. Figure 22 contains the 64 clusters found by clustering with our RC method, showing very obvious semantic similarities within each cluster. It is not immediately noticeable that each block is composed of real (top half) and generated (bottom half) samples, which shows how well the GAN is able to reproduce the specific distributions inherent in each cluster.
In a similar way, Figures 23 and 24 present samples from LLD-icon and LLD-icon-sharp, respectively. Here we compare random samples from different trained models, containing both conditional and unconditional variants. Figure 25, 26, show the clusters found in LLD-icon by clustering in the latent space of an Autoencoder, while Figures 27 and 28 show clusters in LLD-icon-sharp from the feature-space of a ResNet classifier. A very noticeable difference originates from the fact that the Autoencoder was trained on gray-scale images and is thus relatively color-independent, while there are some very apparent single-color clusters in the RC-version, mostly containing green, blue or orange/red logos.
Finally, in Figure 29, we present some samples from our benchmarked CIFAR-10 Generators, together with the achieved inception score. Figures 30 and 31 compare the clusters found using our RC method with the original data labels, with noticeably more visually uniform classes using our synthetic labeling technique.
Appendix D Architecture Details
In this section we specify the exact architectures and hyper-parameters used to train our models.
iWGAN for 3232-pixel output
We use the residual network architecture designed for CIFAR-10 described in  (Appendix C) for this model. For iWGAN-LC, each stage has an input shape of [, …] where is the number of classes, i.e. the number of cluster centers used in our clustering approach. All training hyper-parameters remain untouched and we never use normalization in the Discriminator as this resulted in consistently superior Inception scores in our CIFAR-10 experiments. We use the exact same model and training parameters with our LLD-icon dataset.
iWGAN for 64x64-pixel output
For LLD-logo at 6464 pixels again the official TensorFlow implementation by Gulrajani et al. 555https://github.com/igul222/improved_wgan_training. Again, the input for each stage is extended to have a shape of [, …] where is the size in the original model and is the number of classes. The only change we made here is to only use iterations and linearly decay the learning rate over these iterations.
For DCGAN, we deviate from some hyperparameters used in Taehoon Kim’s TensorFlow implementation666https://github.com/carpedm20/DCGAN-tensorflow, namely:
Higher number of feature maps: (128+, 256+, 512+, 1024+) for the Discriminator layers and (256+, 512+, 1034+, 2048+) for the Generator layers, with again being the number of classes in the LC version.
For each training iteration of the Discriminator, we train the Generator 3 times
Reduced learning rate of 0.0004 (default: 0.002)
Higher latent space dimensionality of 512 components (default: 100)
Blur input images to Discriminator as detailed in Section 3.3 of our paper.