Logo Synthesis and Manipulation with Clustered Generative Adversarial Networks

12/12/2017 ∙ by Alexander Sage, et al. ∙ ETH Zurich 0

Designing a logo for a new brand is a lengthy and tedious back-and-forth process between a designer and a client. In this paper we explore to what extent machine learning can solve the creative task of the designer. For this, we build a dataset -- LLD -- of 600k+ logos crawled from the world wide web. Training Generative Adversarial Networks (GANs) for logo synthesis on such multi-modal data is not straightforward and results in mode collapse for some state-of-the-art methods. We propose the use of synthetic labels obtained through clustering to disentangle and stabilize GAN training. We are able to generate a high diversity of plausible logos and we demonstrate latent space exploration techniques to ease the logo design task in an interactive manner. Moreover, we validate the proposed clustered GAN training on CIFAR 10, achieving state-of-the-art Inception scores when using synthetic labels obtained via clustering the features of an ImageNet classifier. GANs can cope with multi-modal data by means of synthetic labels achieved through clustering, and our results show the creative potential of such techniques for logo synthesis and manipulation. Our dataset and models will be made publicly available at https://data.vision.ee.ethz.ch/cvl/lld/.



There are no comments yet.


page 12

page 13

page 18

page 19

page 21

page 25

page 26

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and related work

Logo design

Designing a logo for a new brand usually is a lengthy and tedious process, both for the client and the designer. A lot of ultimately unused drafts are produced, from which the client selects his favorites, followed by multiple cycles refining the logo to match the clients needs and wishes. Especially for those clients without a specific idea of the end product, this results in a procedure that is not only time, but also cost intensive.

The goal of this work is to provide a framework towards a system with the ability to generate (virtually) infinitely many variations of logos (some examples are shown in Figure 1) to facilitate and expedite such a process. To this end, the prospective client should be able to modify a prototype logo according to specific parameters like shape and color, or shift it a certain amount towards the characteristics of another prototype. An example interface for such a system is presented in Figure 2. It could help both designer and client to get an idea of a potential logo, which the designer could then build upon, even if the system itself was not (yet) able to output production-quality designs.

Figure 1: Original and generated images from four selected clusters from our LLD-icon-sharp dataset. The top three rows consist of original logos, followed by logos generated using our iWGAN-LC trained on 128 RC clusters.
Figure 2: Logo generator interface. The user is able to choose either vicinity sampling or class transfer to modify the image in a chosen semantic direction. For both methods, 8 random variations are arranged around the current logo. Upon selecting the appropriate sample, the current logo can be modified by a variable amount using the slider at the bottom of the window. After confirming the selected modification, the process starts over again from the newly modified logo, until the desired appearance is reached. In addition to vicinity sampling within or across clusters, some pre-defined semantic modifications can be made using the sliders on the right hand side of the first view. The images used here are generated with iWGAN-LC trained at 6464 pixels on LLD-logo clustered to 64 different classes as explained in Section 3.

Logo image data

Existing research literature focused mostly on retrieval, detection, and recognition of a reduced number of logos [14, 17, 30, 32, 34, 42] and, consequently, a number of datasets were introduced. The most representative large public logo datasets are shown in Table 1. Due to the low diversity of the contained logos, these datasets are not suitable for learning and validating automatic logo generators. At the same time a number of web pages allow (paid) access to a large number of icons, such as iconsdb.com (4135+ icons), icons8.com (59900+), iconfinder.com (7473+), iconarchive.com (450k+) and thenounproject.com (1m+). However, the diversity of these icons is limited by the number of sources, namely designers/artists, themes (categories) and design patterns (many are black and white icons). Therefore, we crawl a highly diverse dataset – the Large Logo Dataset (LLD) – of real logos ‘in the wild’ from the Internet. As shown in Table 1 our LLD proposes thousands of times more distinct logos than the largest public logo dataset to date, WebLogo-2M [34].

In contrast to popularly used natural image datasets such as ImageNet [31], CIFAR-10 [21] and LSUN [41], face datasets like CelebA [23] and the relatively easily modeled handwritten digits of MNIST [22], logos are: (1) Artificial, yet strongly multimodal and thus challenging for generative models; (2) Applied, as there is an obvious real-world demand for synthetically generated, unique logos since they are expensive to produce; (3) Hard to label, as there are very few categorical properties which manifest themselves in a logo’s visual appearance. While the logos are easily obtainable in large quantities, they are specifically designed to be unique, which ensures the diversity of a large logo dataset. We argue that all these characteristics make logos a very attractive domain for machine learning research in general, and generative modeling in particular.

Generative models

Recent advances in generative modeling have provided viable frameworks for making such a system possible. The current state-of-the-art is made up mainly of two types of generative models, namely Variational Autoencoders (VAEs) 

[16, 19, 20] and Generative Adversarial Networks (GANs) [2, 10, 11]. Both of these models generate their images from a high-dimensional latent space that can act as a sort of “design space” in which a user is able to modify the output in a structured way. VAEs have the advantage of directly providing embeddings of any given image in the latent space, allowing targeted modifications to its reconstruction, but tend to suffer from blurry output owed to the nature of the pixel-wise loss used during training. GANs on the other hand, which consist of a separate generator and discriminator network trained simultaneously on opposing objectives in a competitive manner, are known to provide realistic looking, crisp images but are notoriously unstable to train. To address this difficulty, a number of improvements in the architecture and training methods of GANs have been suggested [33], such as using deep convolutional layers [28]

or modified loss functions e.g. based on least-squares 


or the Wasserstein distance between probability distributions 

[3, 4, 12].

Conditional models

The first extension of GANs with class-conditional information [25] followed shortly after its inception, generating MNIST digits conditioned on class labels provided to both generator and discriminator during training. It has since been shown for supervised datasets, that class-conditional variants of generative networks very often produce superior results compared to their unconditional counterparts [12, 15, 26]. By adding an encoder to map a real image into the latent space, it was proven to be feasible to generate a modified version of the original image by changing class attributes on faces [6, 27] and other natural images [36]. Other notable applications include the generation of images from a high-level description such as various visual attributes [39] or text descriptions [29].

Our contributions

In this work we train GANs on our own highly multi-modal logo data as a first step towards user-manipulated artificial logo synthesis. Our main contributions are:

  • []

  • LLD - a novel dataset of 600k+ logo images.

  • Methods to successfully train GAN models on multi-modal data. Our proposed clustered GAN training achieves state-of-the-art Inception scores on the CIFAR10 dataset.

  • An exploration of GAN latent space for logo synthesis.

The remainder of this paper is structured as follows. We introduce a novel Large Logo Dataset (LLD) in Section 2. We describe the proposed clustered GAN training, the clustering methods, as well as the GAN architectures used and perform quantitative experiments in Section 3. Then we demonstrate logo synthesis by latent space exploration operations in Section 4. Finally, we draw the conclusions in Section 5.

Dataset Logos Images
FlickLogos-27 [18] 27 1080
FlickLogos-32 [30] 32 8240
BelgaLogos [17] 37 10000
LOGO-Net [14] 160 73414
WebLogo-2M [34] 194 1867177
LLD-icon (ours) 486377 486377
LLD-logo (ours) 122920 122920
LLD (ours) 486377+ 609297
Table 1: Logo datasets. Our LLD provides orders of magnitude more logos than the existing public datasets.

2 LLD: Large Logo Dataset

In the following we introduce a novel dataset based on website logos, called the Large Logo Dataset (LLD). It is the largest logo dataset to date (see Table 1). The LLD dataset consists of two parts, a low resolution (3232 pixel) favicon subset (LLD-icon) and the higher-resolution (400400 pixel) twitter subset (LLD-logo). In the following we will briefly describe the acquisition, properties and possible use-cases for each. Both versions will be made available at https://data.vision.ee.ethz.ch/cvl/lld/.

2.1 LLD-icon: Favicons

Figure 3: Excerpt from LLD-icon.

For generative models like GANs, the difficulty of keeping the network stable during training increases with image resolution. Thus, when starting to work with a new type of data, it makes sense to start off with a variant which is inherently low-resolution. Luckily, in the domain of logo images there is a category of such inherently low-resolution, low-complexity images: Favicons, the small icons representing a website e.g. in browser tabs or favorite lists. We decided to crawl the web for such favicons using the largest resource of high quality website URLs we could find: Alexa’s top 1-million website list111now officially retired, formerly available at https://www.alexa.com. To this end we use the Python package Scrapy222https://scrapy.org/ in conjunction with our own download script which directly converts all icons found to a standardized pixel resolution and RGB color space, discarding all non-square images.

After acquiring the raw data from the web, we remove all exact duplicates (of which there are a surprisingly high number of almost 20 %). Visual inspection of the raw data reveals a non-negligible number of images that do not comply to our initial dataset criteria and often are not even remotely logo-like, such as faces and other natural images. In an attempt to get rid of this unwanted data, we (i) sort all images by PNG-compressed file size – an image complexity indicator; (ii) manually inspect and partition the resulting sorted list into three sections: clean and mostly clean data which are kept, and mostly unwanted data which is discarded; (iii) discard the mostly clean images containing the least amount of white pixels.

The result of this process, a small sample of which is given in Figure 3, is a clean set of 486,377 images of uniform 3232 pixel size, making it very easy to use. The disadvantage of this standardized size is that 54 % of images appear blurry because they where scaled up from a lower resolution. For this reason we will also be providing (the indices for) a subset of the data containing only sharp images, which we will refer to as icons-sharp.

2.2 LLD-logo: Twitter

Figure 4: Excerpt from LLD-logo, scaled down to 6464 pixels.

For training generative networks at an increased resolution, additional high-resolution data is needed, which favicons cannot provide. One possible option would be to crawl the respective websites directly to look for the website or company logo. However, (a) it might not always be straight-forward to find the logo and distinguish it from other images on the website and (b) the aspect ratio and resolution of logos obtained in this way will be very varied, which would necessitate extensive cropping and resizing, potentially degrading the quality of a large portion of logos.

By crawling twitter instead of websites, we are able to acquire standardized square 400400 pixel profile images which can easily be downloaded through the twitter API without the need for web scraping. We use the Python wrapper tweepy to search for the (sub-) domain names contained in the alexa list and match the original URL with the website provided in the twitter profile to make sure that we have found the right twitter user. The images are then run through a face detector to reject any personal twitter accounts and the remaining images are saved together with the twitter meta data such as user name, number of followers and description. For this part of the dataset, all original resolutions are kept as-is, where 80% are at 400400 pixels and the rest at some lower resolution (details given in supplementary material).

The acquired images are analyzed and sorted with a combination of automatic and manual processing in order to get rid of unwanted and possibly sensitive images, resulting in 122,920 usable high-resolution logos of consistent quality with rich meta data from the respective twitter accounts. These logo images form the LLD-logo dataset, a small sample of which is presented in Figure 4.

3 Clustered GAN Training

We propose a method for stabilizing GAN training and achieving superior quality samples on unlabeled datasets by means of clustering (a) in the latent space of an autoencoder trained on the same data or (b) in the CNN feature space of a ResNet classifier trained on ImageNet. With both methods we are able to produce semantically meaningful clusters that improve GAN training.

In this Section we review the GAN architectures used in our study, describe the clustering methods based on Autoencoder latent space and ResNet features and discuss the quantitative experimental results.

3.1 GAN architectures

Our generative models are based on Deep Convolutional Generative Adversarial Networks (DCGAN) of Radford et al[28] and improved Wasserstein GAN with gradient penalty (iWGAN) as proposed by Gulrajani et al[12].


For our DCGAN experiments, we use Taehoon Kim’s TensorFlow implementation 

333https://github.com/carpedm20/DCGAN-tensorflow. We train DCGAN exclusively on the low-resolution LLD-icon subset, for which it proved to be inherently unstable without using our clustering approach. We use the input blurring explained in the next section in all our DCGAN experiments. For details on hyper-parameters used, we refer the interested reader to the supplementary material.


All our iWGAN experiments are based on the official TensorFlow repository by Gulrajani et al[12]444https://github.com/igul222/improved_wgan_training. We kept the default settings as provided by the authors. We exclusively use the 32- and 64-pixel ResNet architectures provided in the repository with the only major modifications being our conditioning method as described below. We also use linear learning rate decay (from the initial value to zero over the full training iterations) for all our experiments.

3.2 Clustering

As mentioned in the introduction (Section 1), training a conditional GAN with labels is beneficial in terms of improved output quality over an unsupervised setting. In particular, we found DCGAN to be unstable with our icon dataset (LLD-icon) for resolutions higher than 1010, and where able to stabilize it by introducing synthetic labels as described in this section. In addition to stabilizing GAN training, we are able to achieve state-of-the-art Inception scores (as proposed by Salimans et al[33]) on CIFAR-10 using iWGAN with our synthetic labels produced by RC clustering as described below, and thus demonstrate quantitative evidence of a quality improvement using this approach in Section 3.4. Furthermore, the cluster labels subsequently allow us to have some additional control over the generated logos by generating samples from individual clusters or transforming an particular logo to inherit the specific attributes of another cluster as demonstrated in Section 4.

AE: AutoEncoder Clustering

Our first proposed method for producing synthetic data labels is by means of clustering in the latent space of an Autoencoder. We construct an Autoencoder, consisting of a modified version of the GAN discriminator with outputs instead of one, acting as an encoder to latent space, and the unmodified GAN Generator acting as a decoder for the reconstruction of the image from the latent representation, as illustrated in Figure 5. This Autoencoder is trained using a simple

loss between original and reconstructed image. All images are then encoded to latent vectors, followed by a PCA dimensionality reduction and finally clustered using (mini-batch) k-means. For our logo data, this produces clusters that are both semantically meaningful, as they are based on high-level AE features, and recognizable by the GAN because they where created using the same general network topology.

Figure 5: Autoencoder used for AE clustering. The generator G is equivalent to the one used in the GAN, while the encoder E consists of the GAN discriminator D with a higher number of outputs to match the dimensionality of the latent space z. It is trained using a simple loss function.

RC: ResNet Classifier Clustering

For our second clustering method we leverage the learned features of an ImageNet classifier, namely ResNet-50 by He et al[13]. We feed our images to the classifier and extract the output of the final pooling layer from the network to get a 2048-dimensional feature vector. After a PCA dimensionality reduction we can cluster our data in this feature space with (minibatch) k-means. The obtained clusters are considerably superior to those produced with our AE clustering method on CIFAR-10, where one could argue that we are benefiting from the similarity in categories between ImageNet and CIFAR-10, and are thus indirectly using labeled data. However, the clustering is very meaningful also on our logo dataset, which has a very different content and does not consist of natural images like ImageNet, proving the generality of this approach.

3.3 Conditional GAN Training Methods

In this section we describe the conditional GAN models used to leverage our synthetic data labels and the input blurring applied to DCGAN.

LC: Layer Conditional GAN

Figure 6: Layer Conditional Residual block as used in our iWGAN-LC. The label information is appended to the convolutional layer input in the same way as described in Figure 7. The skip connections remain unconditional.
Figure 7: Generator network as used for our layer conditional DCGAN (DCGAN-LC). 100 labels y are appended as a one-hot vector to the latent vector. It is also projected onto a set of feature maps consisting of all zeros except for the map corresponding to the class number, where all elements have value one. These additional feature maps are then appended to the input of each convolutional layer.

In our layer-conditional models, the cluster label for each training sample is fed to all convolutional and linear layers of both generator and discriminator. For linear layers it is simply appended to the input as a one-hot vector. For convolutional layers the labels are projected onto “one-hot feature maps” with as many channels as there are clusters, where the one corresponding to the cluster number is filled with ones, while the rest are zero. These additional feature maps are appended to the input of every convolutional layer, such that every layer can directly access the label information. This is illustrated in Figure 7 for DCGAN and Figure 6

for ResNet as used in our iWGAN model. Even though the labels are provided to every layer, there is no explicit mechanism forcing the network to use this information. In case the labels are random or meaningless, they can simply be ignored by the network. However, as soon as the discriminator starts adjusting its criteria for each cluster, it forces the generator to produce images that comply with the different requirements for each class. Our experiments confirm that visually meaningful clusters are always picked up by the model, while the network simply falls back to the unconditional state for random labels. This type of class conditioning has some useful properties such as the ability to interpolate between different classes and is less prone to failure in producing class-conditional samples compared to the AC conditioning described below. However, it does come with the drawback of adding a significant number of parameters, especially to low-resolution networks, when there are a large number of classes. This effect diminishes with larger networks containing more feature maps, as the number of added parameters remains constant.

AC: Auxiliary Classifier GAN

With iWGAN we also use the Auxiliary Classifier proposed by Odena et al[26] as implemented by Glurajani et al[12]. While this method does not allow us to interpolate between clusters and is thus slightly more limited from an application perspective, it does avoid adding parameters to the convolutional layers, which in general results in a network with fewer parameters. iWGAN-AC was our method of choice for CIFAR-10, as it delivers the highest Inception scores.

Gaussian Blur

During our experiments we noticed how blurring the input image helps the network remain stable during training, which in the end lead us to apply a Gaussian blur on all images presented to the discriminator (training data as well as samples from the Generator), like it has been previously implemented by Susmelj et al[35]. The method is schematically illustrated in Figure 8. Upscaling the images to 6464 pixel resolution before convolving them with the Gaussian kernel enables us to train with blurred images while preserving almost all of the image’s sharpness when scaled back down to the original resolution of 3232 pixels. When generating image samples from the trained Generator without applying the blur filter, there is some noticeable noise in the images, which becomes imperceptible after resizing to the original data resolution while producing almost perfectly sharp output images. Based on our experimental experience we believe this to produce higher quality samples and help stability, it is however not strictly necessary to achieve stability with DCGAN when using clustered training.

Figure 8: Generative Adversarial Net with blurred Discriminator input. Both original and generated images are blurred using a Gaussian filter of fixed strength.

3.4 Quantitative evaluation and state-of-the-art

In order to quantitatively assess the performance of our solutions on the commonly used CIFAR-10 dataset we report Inception scores [33] and diversity scores based on MS-SSIM [37] as suggested in [26] over a set of 50000 randomly generated images. In Table 2 we summarize results for different configurations in supervised (using CIFAR class labels) and unsupervised settings in LC and AC conditional modes, including reported scores from the literature.

Method Clusters Inception Diversity
score (MS-SSIM)
Infusion training[5] 4.620.06
ALI [9](from[38]) 5.340.05


Impr.GAN(-L+HA)[33] 6.860.06
EGAN-Ent-VI [7] 7.070.10
DFM [38] 7.720.13
iWGAN [12] 7.860.07
iWGAN 7.8530.072 0.05040.0017
iWGAN-LC with AE clustering 32 7.3000.072 0.05070.0016
iWGAN-LC with RC clustering 32 7.8310.072 0.04910.0015
iWGAN-LC with RC clustering 128 7.7990.030 0.04910.0015
iWGAN-AC with AE clustering 32 7.8850.083 0.05040.0014
iWGAN-AC with RC clustering 10 8.4330.068 0.05050.0016
iWGAN-AC with RC clustering 32 8.6730.075 0.05000.0016
iWGAN-AC with RC clustering 128 8.6250.109 0.04650.0015


iWGAN-LC 7.7100.084 0.05100.0013
Impr.GAN [33] 8.090.07
iWGAN-AC [12] 8.420.10
iWGAN-AC 8.350.07 0.0490.0018
AC-GAN [26] 8.250.07
SGAN [15] 8.590.12
CIFAR-10 (original data) 11.2370.116 0.04850.0016
Table 2: Comparison of Inception and diversity scores (lower score = higher diversity) on CIFAR-10. The unsupervised methods do not use the CIFAR-10 class labels. Note that our unsupervised methods achieve state-of-the-art performance comparable to the best supervised approaches.
Method Clusters CORNIA Diversity
score (MS-SSIM)
DCGAN-LC with AE clustering 100 62.120.51 0.04750.0013
iWGAN-LC with AE clustering 100 60.240.61 0.04390.0010
*iWGAN 54.270.67 0.04880.0011
*iWGAN-LC with RC clustering 16 55.370.67 0.04900.0014
*iWGAN-LC with RC clustering 128 55.270.68 0.04840.0010
LLD-icon (original data) 61.000.62 0.04820.0014
*LLD-icon-sharp (original data) 55.370.67 0.04940.0011
Table 3: CORNIA scores and diversity scores for models trained on LLD-icon. The starred (*) models where trained on the subset LLD-icon-sharp. Lower values mean higher quality for CORNIA and higher diversity for MS-SSIM.


On CIFAR-10, increasing the number of RC clusters from 1 to 128 leads to better diversity scores for iWGAN-AC, at the same time the Inception score peaks above 32 clusters. We note that using RC clustering leads to better performance than using AE clustering.

Performance and state-of-the-art

Our best Inception score of achieved with iWGAN-AC and 32 RC clusters is significantly higher than by Salimans et al[33] with their Improved GAN method, the best score reported in the literature for unsupervised methods. Surprisingly, our best result, achieved with unsupervised synthetic labels provided by RC clustering, is comparable to of the Stacked GANs approach by Huang et al[15], the best score reported for supervised methods.

Image quality

Complementary to the Inception and diversity scores we also measured the image quality using CORNIA, a robust no-reference image quality assessment method proposed by Ye and Doermann [40]. On both CIFAR-10 and LLD-icon our generative models obtained CORNIA scores equivalent to those of the original images from each dataset. This result is in-line with the findings in [35], where the studied GANs also converge in terms of CORNIA scores towards the data image quality at GAN convergence. We show the CORNIA and MS-SSIM scores for the LLD-icon dataset, as a complement to the Inception scores on CIFAR-10, in Table 3.

LC vs. AC for conditional GANs

Our AC-GAN variants are better than their LC counterparts in terms of Inception scores, but comparable in terms of diversity for CIFAR-10. We believe that this is owed to fact that AC-GAN enforces the generation of images which can easily be classified to the provided clusters, which in turn could raise the classifier-based Inception score. Even though the numbers indicate a qualitative advantage of AC- over LC-GAN, we prefer the latter for our logo application as it allows smooth interpolations even in-between different clusters. This is not possible in the standard AC-GAN implementation since the cluster labels are discrete integer values and thus all our desirable latent space operations would be constrained to be performed within a specific data cluster, which does not match our intended use.

4 Logo synthesis by latent space exploration

As mentioned in the previous section, layer conditioning allows for smooth transitions in the latent space from one class to another, which is critical for logo synthesis and manipulation by exploration of the latent space. Therefore, we work with two configurations for these experiments: iWGAN-LC with 128 RC clusters and DCGAN-LC with 100 AE clusters. Their Inception, diversity and CORNIA scores are comparable on the LLD-icon dataset.

4.1 Sampling

In generative models like GANs [11] and VAEs [20]

, images are generated from a high-dimensional latent vector (with usually somewhere between 50 and 1000 dimensions), also commonly referred to as z-vector. During training, each component of this vector is randomly sampled from a Uniform or Gaussian distribution, so that the generator is trained to produce a reasonable output for any random vector sampled from the same distribution. The space spanned by these latent vectors, called the latent space, is often highly structured, such that latent vectors can be deliberately manipulated in order to achieve certain properties in the output 

[6, 8, 28].

Figure 9: The first four (random) clusters of LLD-icon as attained with our AE-Clustering method using 100 cluster centers. The top half of each example contains a random selection of original images, while the bottom half consists of samples generated by DCGAN-LC for the corresponding cluster. The very strong visual correspondence demonstrates the network’s ability to capture the data distributions inherent the classes produced by our clustering method.

Using DCGAN-LC with 100 AE clusters on the same data, Figure 9 contains samples from a specific cluster next to a sample of the respective original data. This shows how the layer conditional DCGAN is able to pick up on the data distribution and produce samples which are very easy to attribute to the corresponding cluster and are often hard to distinguish from the originals at first glance. For comparison we also show results for iWGAN-LC with 128 RC clusters trained on the LLD-icon-sharp dataset in Figure 1.

4.2 Interpolations

To show that a generator does not simply learn to reproduce samples from the training set, but is in fact able to produce smooth variations of its output images, it is common practice [10] to perform interpolations between two points in the latent space and to show that the outcome is a smooth transition between the two corresponding generated images, with all intermediate images exhibiting the same distribution and quality. Interpolation also provides an effective tool for a logo generator application, as the output image can be manipulated in a controlled manner towards a certain (semantically meaningful) direction in latent space.

Figure 10: Interpolation between 4 selected logos of distinct classes using DCGAN-LC with 100 AE clusters on LLD-icon, showcasing smooth transitions and interesting intermediate samples in-between all of them.

For all our interpolation experiments we use the distribution matching methods from [1] in order to preserve the prior distribution the sampled model was trained on. An example with 64 interpolation steps to showcase the smoothness of such an interpolation is given in Figure 10 where we interpolate between 4 sample points, producing believable logos at every step. As it is the case in this example, the interpolation works very well even between logos of different clusters, even though the generator was never trained for mixed cluster attributes.

Figure 11: Continuous interpolation between 5 random points each within one cluster (top) and in-between distinct clusters (bottom) in latent space using iWGAN-LC with 128 RC clusters on icon-sharp. We observe smooth transitions and logo-like samples in all of the sampled subspace.

Some more interpolations between different logos both within a single cluster and between logos of different clusters are shown in Figure 11, this time between 2 endpoints and with only 8 interpolation steps.

4.3 Class transfer

As the one-hot class vector representing the logo cluster is separate from our latent vector, it is also possible to keep the latent space representation constant and only change the cluster of a generated logo. Figure 12 contains 11 logos (top row) that are being transformed to a particular cluster class in each subsequent row. This shows how the general appearance such as color and contents are encoded in the z-vector while the cluster label transforms these attributes into a form that conforms with the contents of the respective cluster. Here, again, interpolation could be used to create intermediate versions as desired.

Figure 12: Logo class transfer using DCGAN-LC on LLD-icon with 100 AE clusters. The logos of the 1st row get transferred to the class (cluster) of the logos in the 1st column (to the left). Hereby the latent vector is kept constant within each column and the class label is kept constant within each row (except for the 1st ones, resp.). The original samples have been hand-picked for illustrative purposes.

4.4 Vicinity sampling

Figure 13: Vicinity Sampling using iWGAN-LC on LLD-icon-sharp with 128 RC clusters.

Another powerful tool to explore the latent space is vicinity sampling, where we perturb a given sample in random directions of the latent space. This could be useful to present the user of a logo generator application with a choice of possible variants, allowing him to modify his logo step by step into directions of his choice. In Figure 13 we present an example of a 2-step vicinity sampling process, where we interpolate one-third towards random samples to produce a succession of logo variants.

4.5 Vector arithmetic example: Sharpening

For models trained on our LLD-icon data, some of the generated icons are blurry since roughly half of the logos in this dataset are upscaled from a lower resolution. However, by averaging over the z-vector of a number of blurry samples and subtracting from this the average of a number of sharp samples, it is possible to construct a “sharpening” vector which can be added to blurry logos to transform them into sharp ones. This works very well even if the directional vector is calculated exclusively from samples in one cluster and then applied samples of another, showing that the blurriness is in fact nothing more than a feature embedded in latent space. The result of such a transformation is shown in Figure 14, where such a sharpening vector was calculated from 40 sharp and 42 blurry samples manually selected from two random batches of the same cluster. The resulting vector is then applied equally to all blurry samples. The quality of the result, while already visually convincing, could be further optimized by adding individually adjusted fractions of this sharpening vector to each logo.

This example of adding a sharpening vector to the latent representation is only one of many latent space operations one could think of, such as directed manipulation of form and color as performed in the supplementary material.

(a) Original samples
(b) Sharpened samples
Figure 14: Sharpening of logos in the latent space by adding an offset calculated from the latent vectors of sharp and blurry samples. We used DCGAN-LC and 100 AE clusters.

5 Conclusions

In this paper we tackled the problem of logo design by synthesis and manipulation with generative models:

  • We introduced a Large Logo Dataset (LLD) crawled from Internet with orders of magnitude more logos than the existing datasets.

  • In order to cope with the high multi-modality and to stabilize GAN training on such data we proposed clustered GANs, that is GANs conditioned with synthetic labels obtained through clustering. We performed clustering in the latent space of an Autoencoder or in the CNN features space of a ResNet classifier and conditioned DCGAN and improved WGAN utilizing either an Auxiliary Classifier or Layer Conditional model.

  • We quantitatively validated our clustered GAN approaches on a CIFAR-10 benchmark where we set a clear state-of-the-art Inception score for unsupervised generative models, showcasing the benefits of meaningful synthetic labels obtained through clustering in the CNN feature space of a an ImageNet classifier.

  • We showed that the latent space of the networks trained on our logo data is smooth and highly structured, thus having interesting properties exploitable by performing vector arithmetic in that space.

  • We showed that the synthesis and manipulation of (virtually) infinitely many variations of logos is possible through latent space exploration equipped with a number of operations such as interpolations, sampling, class transfer or vector arithmetic in latent space like our sharpening example.

Our solutions ease the logo design task in an interactive manner and are significant steps towards a fully automatic logo design system.

For more results, operations, and settings the reader is invited to consult the supplementary material.


Supplementary Material

The following pages contain the supplementary material for this paper.

After presenting some latent space exploration experiments with LLD-logo in Section A, we give some additional details on the data collection process as well as the final contents of our LLD datasets in Section B. We then proceed to show in Section C, for each subset of our Large Logo Dataset, an excerpt of the collected data together with generated samples from selected GAN architectures and the clusters produced by the applied clustering methods. For CIFAR-10 we also show samples from our cluster-conditional CIFAR-10 models together with samples from the unconditional and supervised iWGAN variants in this section. Finally, we give some details on architecture and training hyper-parameters of our models in Section D..

Appendix A Latent space exploration on LLD-logo

In this section, we present some interpolations on the LLD-logo dataset and perform two additional experiments with latent space operations.


In Figure 15 we present two examples of interpolations between 4 different samples, representing a small section of the high-dimensional logo manifold created by the GAN.

Vector arithmetic

First, we define two desirable operations we would like to perform (1) Color shifts from red to blue and blue to red and (2) Shape changes from square to round and round to square. For each of those semantic operations we identify a number (for our experiments around 30) of samples that match our criteria. To get operation (1) this means we select 30 red and 30 blue logos. We then construct a directional vector by subtracting the mean latent space vector of all blue logos from the mean latent space vector of all red logos, which gives us a directional vector from red to blue. Since some of these semantic attributes are expected to be encoded in the cluster labels as well, we can do the same with our one-hot encoded class vectors, which we can view as an additional cluster space. In Figure 

16 we add this directional vector to a new random batch of generated logos. If we subtract the directional vector, we get a shift in the opposite direction, i.e. from blue to red. To find out how much of the color information is encoded in the latent representation and in the clusters respectively, we can perform the operation in only one of these domains. This is done in Figure 17 for the red-shift, where we observe a very similar behavior for both spaces, indicating that the color information is equally encoded in both latent space and labels.

Our second experiment is performed in the same way, and the directional vector is applied to the same batch of samples. Figure 18, again, shows the result for a simultaneous addition of both (latent and class) vectors in each direction, whereas each space is considered individually in Figure 19 for the directional vectors towards round logos. Here we can observe that some logos respond better to the change in latent space, while others seem more responsive to a changing cluster label. Overall, the label information seems to be a little stronger in this case.

In both experiments, the combined shift clearly performs best, and could provide a powerful tools for logo manipulation and other applications.

(a) Interpolation between 4 square logos.
(b) Interpolation between logos of different shape.
Figure 15: Four-point interpolation on LLD-logo.
(a) Random Sample, unmodified.
(b) Sample from LABEL:sub@fig:ar-orig-a shifted towards blue logos
(c) Sample from LABEL:sub@fig:ar-orig-a shifted towards red logos
Figure 16: Blue-red shift on a random batch. Directional vectors are both applied in latent space and in cluster label space.
(a) Samples from Figure (a)a shifted towards red logos only in latent vector space
(b) Samples from Figure (a)a shifted towards red logos only in label vector space
Figure 17: Blue-red shift on a random batch performed in either latent representation or cluster labels.
(a) Random Sample, unmodified. (Same as Figure (a)a)
(b) Samples from Figure LABEL:sub@fig:ar-orig-shape shifted towards round logos.
(c) Samples from Figure LABEL:sub@fig:ar-orig-shape shifted towards square logos
Figure 18: Round-square shape shift on a random batch. Directional vectors are both applied in latent space and in cluster label space.
(a) Samples from Figure (a)a shifted towards red logos only in latent vector space
(b) Samples from Figure (a)a shifted towards red logos only in label vector space
Figure 19: Round-square shape shift on a random batch performed in either latent representation or cluster labels.

Appendix B LLD crawling and image statistics

b.1 LLD-icon

When collecting the favicons for LLD-icon, our download script directly converted all icons found to a standardized 32x32 pixel resolution and RGB color space, discarding all non-square images. After acquiring the raw data from the web, we remove all exact duplicates and perform a three-stage clean-up process:

  1. [noitemsep,nolistsep]

  2. Sort all images by complexity by evaluating its PNG-compressed file size

  3. Manually inspect and partition the resulting sorted list into three sections: Clean, mostly clean and mostly unwanted data. The last section is discarded, while the middle part (mostly clean) is further processed in the next step.

  4. Sort the intermediate section according to the number of white pixels in each image and cut off at a certain point after inspection, discarding the images containing the least amount of white pixels.

Table 4 shows statistics on the crawling process, original image resolutions the icons where rescaled from, and numbers on content removed through our clean-up process.

b.2 LLD-logo

During the collection of LLD-logo on twitter, we use a face detector recognize faces and proceed to the next user in the search results if a face was detected. At the same time, we make use of twitters (relatively new) sensitive content flag to reject such flagged profiles. As the number of rejected profiles in Table 

5 compared to the number of discarded images during cleanup (of which a substantial number where due to sensitive content) shows, this flag is only used very sporadically at this time, and is far from a reliable indicator. Figure 20 shows a histogram of image resolutions contained in LLD-icon (where no re-scaling was performed during data collection), with the top-5 image resolutions (amounting to 92% of images) given in Table 6.

Figure 20: Histogram of image sizes in LLD-logo. There are a total of 329 different image resolutions contained in the dataset.
Failed requests
Unreadable files
Non-square images
Unable to process
Total images saved
Image re-scaling
Native 32 p
Scaled up
Scaled down
Dataset cleanup
Duplicates removed
Discarded due to content
Clean dataset size
Table 4: Crawling statistics for LLD-icon
Flagged content ignored
Downloaded images
Discarded during cleanup
Final dataset size
Table 5: Crawling and clean-up statistics for LLD-icon
Image height (px) Number of images of total
Table 6: The 5 most prominent image resolutions in LLD-logo, covering 92.3% of the contained images.

Appendix C Logo Data, clusters and generated samples

In this section, we will show a small sample from each of our introduced datasets and present generated icons from models trained on said dataset. Additionally, we show the data clusters produced by our clustering methods.

Starting with LLD-logo, Figure 21 shows a sample of the original data collected (reduced to 6464 pixels) next to the logos generated by an iWGAN model trained at 6464 pixels. Compared to LLD-icon, these logos contain a lot more text and sometimes more detailed images. Both of these features are recreated nicely by the model, where the text is often (but not always) illegible while still of a realistic appearance. We expect the legibility of the text to be much higher if our data would not contain a lot of non-latin (e.g. Chinese) characters. Figure 22 contains the 64 clusters found by clustering with our RC method, showing very obvious semantic similarities within each cluster. It is not immediately noticeable that each block is composed of real (top half) and generated (bottom half) samples, which shows how well the GAN is able to reproduce the specific distributions inherent in each cluster.

In a similar way, Figures 23 and 24 present samples from LLD-icon and LLD-icon-sharp, respectively. Here we compare random samples from different trained models, containing both conditional and unconditional variants. Figure 2526, show the clusters found in LLD-icon by clustering in the latent space of an Autoencoder, while Figures 27 and 28 show clusters in LLD-icon-sharp from the feature-space of a ResNet classifier. A very noticeable difference originates from the fact that the Autoencoder was trained on gray-scale images and is thus relatively color-independent, while there are some very apparent single-color clusters in the RC-version, mostly containing green, blue or orange/red logos.

Finally, in Figure 29, we present some samples from our benchmarked CIFAR-10 Generators, together with the achieved inception score. Figures 30 and 31 compare the clusters found using our RC method with the original data labels, with noticeably more visually uniform classes using our synthetic labeling technique.

(a) Original data
(b) iWGAN-LC with 64 RC clusters
Figure 21: Random samples from LLD-logo data and trained iWGAN model using 64 RC clusters and a 6464 pixel output resolution.
Figure 22: All 64 clusters of LLD-logo clustered with a ResNet classifier for 64 cluster centers. The top half of each block contains 9 random samples of original images from the cluster, while the bottom half contains 9 random samples from the iWGAN-LC Generator trained at 6464 pixels. Best viewed as PDF at 400% magnification.
(a) Original data
(b) DCGAN-LC with 100 AE clusters
(c) iWGAN-LC with 100 AE clusters
(d) iWGAN-LC with 128 RC Clusters
Figure 23: Random samples from LLD-icon and generative models trained on this data.
(a) Original data
(b) Unconditional iWGAN
(c) iWGAN-LC with 16 RC clusters
(d) iWGAN-LC with 128 RC Clusters
Figure 24: Random samples from LLD-icon-sharp and generative models trained on this data.
Figure 25: Clusters 1-70 of LLD-icon clustered in the latent space of an Autoencoder with 100 cluster centers. The top half of each block contains 9 random samples of original images from the cluster, while the bottom half contains 9 random samples from the DCGAN-LC Generator.
Figure 26: Clusters 71-128 of LLD-icon clustered in the latent space of an Autoencoder with 100 cluster centers. The top half of each block contains 9 random samples of original images from the cluster, while the bottom half contains 9 random samples from the DCGAN-LC Generator.
Figure 27: Clusters 71-100 of LLD-icon clustered in the latent space of an Autoencoder with 100 cluster centers. The top half of each block contains 9 random samples of original images from the cluster, while the bottom half contains 9 random samples from the DCGAN-LC Generator.
Figure 28: Clusters 71-128 of LLD-icon-sharp clustered with a ResNet Classifier and 128 cluster centers. The top half of each block contains 9 random samples of original images from the cluster, while the bottom half contains 9 random samples from the iWGAN-LC Generator.
(a) iWgan unconditional. Inception score: 7.85
(b) iWGAN-AC with 32 RC clusters. Inception score: 8.67
(c) iWGAN-AC with original labels. Inception score: 8.35
(d) iWGAN-LC with 32 RC clusters. Inception score: 7.83
Figure 29: Random samples from different iWGAN models trained on CIFAR-10 data.
(a) Original data labels (10 categories)
(b) Clustering in Autoencoder space with 32 cluster centers
Figure 30: Original labels and 32 AE clusters. Note the strong variability in visual appearance within the semantic classes, pointing to a possible advantage of using a clustering more in-line with visual semantics. Our experiments with AE clustering produced clearly inferior results on the CIFAR-10 dataset (as compared to our own LLD data).
(a) Clustering in the CNN feature space of a ResNet classifier with 10 cluster centers
(b) Clustering in the CNN feature space of a ResNet classifier with 32 cluster centers
Figure 31: Resulting clusters using RC clustering with 10 and 32 cluster centers. Compared to the original labels in Figure 30, the 10 clusters shown here are more uniform in visual appearance, however increasing the number of clusters to 32 gives each of them an even more visually consistent appearance.

Appendix D Architecture Details

In this section we specify the exact architectures and hyper-parameters used to train our models.

iWGAN for 3232-pixel output

We use the residual network architecture designed for CIFAR-10 described in [12] (Appendix C) for this model. For iWGAN-LC, each stage has an input shape of [, …] where is the number of classes, i.e. the number of cluster centers used in our clustering approach. All training hyper-parameters remain untouched and we never use normalization in the Discriminator as this resulted in consistently superior Inception scores in our CIFAR-10 experiments. We use the exact same model and training parameters with our LLD-icon dataset.

iWGAN for 64x64-pixel output

For LLD-logo at 6464 pixels again the official TensorFlow implementation by Gulrajani et al. [12]555https://github.com/igul222/improved_wgan_training. Again, the input for each stage is extended to have a shape of [, …] where is the size in the original model and is the number of classes. The only change we made here is to only use iterations and linearly decay the learning rate over these iterations.


For DCGAN, we deviate from some hyperparameters used in Taehoon Kim’s TensorFlow implementation 

666https://github.com/carpedm20/DCGAN-tensorflow, namely:

  • Higher number of feature maps: (128+, 256+, 512+, 1024+) for the Discriminator layers and (256+, 512+, 1034+, 2048+) for the Generator layers, with again being the number of classes in the LC version.

  • For each training iteration of the Discriminator, we train the Generator 3 times

  • Reduced learning rate of 0.0004 (default: 0.002)

  • Higher latent space dimensionality of 512 components (default: 100)

  • Blur input images to Discriminator as detailed in Section 3.3 of our paper.