gan are a neural network architecture whose main purpose is generating previously unseen instances of data following a previously learned distribution. For example, they can be used to generate new handwritten digits or faces of people who do not exist (goodfellow_generative_2014).
This works by having two separate neural networks, the generator and the discriminator. The generator’s task is to create realistic samples and the discriminator’s task is to determine whether a given sample is real or a fake from the generator. The input to the generator is a vector of noise, and its output a data sample. The discriminator’s input is a data sample and its output is a decision whether the sample if real or fake. Over the course of training a gan, the generator learns to create more realistic samples while the discriminator gets better at distinguishing real and fake instances.
cgan (mirza_conditional_2014) were introduced recently after gan. Their modification is an additional label input for both the generator and the discriminator. For example, for learning to generate digits, the labels would be digits from 0 to 9. For example, if one gave the label “2” to the generator it would try to generate a digit that resembles “2”. The discriminator then would also get the label “2” so that it knows that the sample it is looking at should be “2” no matter if it is real or fake.
cgan can be used for tagging images by utilizing the discriminator (mirza_conditional_2014). It is also possible to use the labels in the generator to transition between different cases (gauthier_conditional_2015), like for example transitioning from “1” to “7”.
Commonly, cgan are combined with cnn to be able to process image data (radford_unsupervised_2015). For the generator this is done by concatenating the label with the noise vector that is fed into the generator. For the discriminator the image is processed by a regular cnn. After the cnn there is one (or more than one) fully connected layer, which takes as its input the output of the cnn as well as the label. Finally the discriminator outputs its decision regarding the realness of the sample.
We present one new dataset for gan: The dataset consists of thousands of images of facades of buildings in various cities. Each image also has a label, which indicates the city of the building depicted.
We show that both regular Convolutional gan as well as Convolutional cgan have significant shortcomings for our use case. We then propose a new custom architecture, which delivers higher-quality results as well as more stable convergence for our datasets.
In summary, we make the following contributions:
a new neural network architecture for Convolutional cgan
two datasets of facades
insight about the architectural features of the cities under study
program code that allows practitioners to reproduce our experiments and generate similar datasets
All our code and the dataset we used are available at https://github.com/muxamilian/city-gan.
We implemented City-GAN, a Generative Adversarial Network (goodfellow_generative_2014) that can generate facades of buildings in cities it was trained on.
To achieve this, we followed these steps:
We chose four cities for our analysis: Vienna, Paris, Amsterdam and Manhattan.
We created a list of valid addresses in these cities.
We downloaded images from Google Street View from a randomly sampled subset of these addresses in each city. We chose the altitude angle of the camera to be 20 degrees.
We filtered images which fulfilled the following criteria:
It is possible to find a way from the left to the right edge of the image that only goes over facades and has no gaps in between.
The building does not show anomalous structures. For example, we consider construction sites as anomalous because they do not represent the typical cityscape.
Figure 1. Two examples of pictures that we did not include in our dataset: on the left image there are gaps in the facades, the right one is anomalous (not a facade).
After filtering we had 1000 images per city (4000 in total).
We experimented with different types of GANs and different architectures.
We used another, larger dataset (zamir_repo_2019) and checked whether a significantly higher number of samples improves the learning performance. This dataset consists of 500,000 facades each for Amsterdam, Washington DC, Florence, Las Vegas and Manhattan. These images are, however, not curated. The only restriction we made is that the altitude angle must be between 0 and 40 degrees. Otherwise, these images are randomly sampled from the cities with no further restrictions.
We tried different GAN architectures during the course of our experimentation. For all our experiments we used the following data augmentation methods: horizontal flipping with probability 0.5 as well as random cropping of approximately 85% of the original width and height of the images. We experimented with different cropped image sizes: 64, 128 and 256 pixels in height and width. For a final image of 256 pixels, we would input an image of 300 pixels and randomly crop it to 256.
3. Regular GAN
First, we used a regular deep convolutional GAN (radford_unsupervised_2015)
and didn’t distinguish between the different cities. The code we used we got from the official PyTorch examples(pytorch_deep_2019) which implements the procedure outlined by radford_unsupervised_2015. However, from visually inspecting the results, we concluded that the GAN has a hard time learning the data distribution.
The generator starts with a vector of multivariate gaussian noise (length 100) and continuously deconvolutes it (kernel size 4, stride 2) until it outputs the image of the final size.
The discriminator gets the image and continuously convolutes it (kernel size 4, stride 2) until it is of dimension
. Then there is one linear layer followed by a sigmoid function that outputs whether the sample is real or fake.
Figure 2 shows the results of our generator after sufficient training iterations. We see that while its output looks like facades, the output is clearly faulty and can be easily characterized by a human as being fake.
4. Conditional GAN 1
After realizing that a regular GAN has problems learning the data distribution we used a conditional GAN (CGAN) (mirza_conditional_2014). The rationale behind this was that if we tell the GAN which city a picture is from, it can focus on learning facades and doesn’t get confused by different cities’ styles. The difference between a GAN and a CGAN is that the latter receives label information both in the generator and discriminator. As such, the discriminator has more information to distinguish between real and fake pictures, and the generator has an incentive to develop features specifically for each label.
We implemented it ourselves based on the code of the regular GAN. The generator of the CGAN gets a one-hot-encoded vector of the cities alongside the multivariate Gaussian distribution and can thus consider the additional information about the city that the discriminator expects in the image generation process. For instance, when the generator is supposed to output an image of Amsterdam, it gets the vectorattached to its input. Thus the input to the generator network is a vector of length 104 with our four cities (100+4).
The discriminator of the GAN proceeds as in the regular GAN and uses a convolutional neural network to process the input image. After the convolutional layers we concatenate their output with the one-hot-encoded vector of the city and pass it through a dense layer followed by a ReLU function. After that we have another dense layer followed by a sigmoid function that outputs whether the sample is real or fake.
Figure 3 shows that the results are even worse than for the regular GAN.
5. Conditional GAN with custom architecture
After achieving suboptimal results with this CGAN architecture we tried a different approach and modified the discriminator: instead of adding the one-hot-encoded vector with the city information after the convolution, we decided to add it as an input to each pixel in the beginning of the convolution. Thus, additional to the RGB values, each pixel also has the one-hot-encoded information of the city. For example, with four cities, each pixel has 7 color channels: 3 for the colors and 4 for the one-hot-encoded cities (Figure 4).
While this approach sounds wasteful because of the duplicate information that is fed into the convolutional layer, it achieved significantly better results. One possible explanation for this improvement is that by adding the city information to the input picture, we help the kernels of the convolution to work differently for each city.
Figure 5 shows that the results of our custom CGAN are significantly better than with our previous attempt and are also better than the normal GAN.
6. Influence of image size on the quality of generated images
We also tested the effect of image size on the quality of the generated images. For this, we used images with a length and width of 64, 128 and 256 pixels. Our general conclusion is that smaller image sizes improve quality: with 64 pixels Figure 6, the generated facades looked good and had no or only very minor flaws. With 128 we got a larger number of artifacts and faulty images but we deemed quality to be still acceptable. With 256, we couldn’t get satisfying results any longer: the generator never converged and changed its output over time when fed the same random input. Furthermore, the generator did not generalize over different cities and, when inputting an average of all cities using the input vector —which gives equal importance to all cities—the obtained image was only noise. Thus, the generator only learned to output facades for each city but did not learn which features all cities have in common. During the course of our further experiments we only used images of resolution .
7. Using a larger, uncurated dataset
Finally, we wanted to investigate whether a larger dataset could significantly improve the quality of generated images. For this we used a dataset of Stanford University (pytorch_deep_2019) and extracted 500,000 images for the cities of Amsterdam, Washington DC, Florence, Las Vegas and Manhattan. These images are not curated, thus it is not guaranteed that they actually show facades.
We noticed that Washington DC and Las Vegas are especially problematic for our neural network. In Washington DC, the majority of images depict trees, and in Las Vegas there are few multi-story buildings.
Thus we also tried training the CGAN with only these three cities, hoping this would increase the quality of learned images.
Generally speaking we think that the quality of the generated images is the same as the one of our own dataset Figure 7. However, while with our dataset we only learn facades, with the Stanford dataset we also learn entire streets with building on the left and right side of the street.
8. Traveling Between Cities
To understand if our network learns characteristics of each city, we can compare the output of the generator for the same vector with different labels. We show a 5-step interpolation for 3 distinct random vectors in Figure8. We show results interpolating from Amsterdam to each one of Florence, Washington DC and Manhattan. Additionally, we show also results of interpolating from “Europe” (half Amsterdam/half Florence) to the “US” (a third of each American city). These results are from the CGAN with the uncurated dataset described in the last section.
The first thing we notice is that the color palette of the cities is quite different. For example, Amsterdam is quite redish, while Florence has tones closer to beige (a). Another obvious characteristic is that Washington DC has a lot of trees (b). In all the examples with it, the amount of trees greatly increases when compared to Amsterdam.
There are also some more nuanced changes that characterize the stylistic architectural differences between cities. A good example of this is show between Amsterdam and Florence: distinct roofs, with breaks between them can be seen. This is most noticeable in the second row of a. Between Amsterdam and Manhattan there are also two strong distinctions: Manhattan has taller buildings, with more smaller windows. In the second row of Figure (c)c, the buildings grow as the label approaches Manhattan, and in the third row the windows split into smaller windows.
We also show the transition between Europe and the US. The US images look quite unrealistic, as they are an average between 3 cities with quite different styles. However, one can identify an increase of trees (Washington DC has really a lot of trees) and different angles of photography (buildings are farther from the camera in Washington DC and Las Vegas). Both of these differences are evident in the second row of d. Another good example of the Europe/US relation is in Figure 9, which shows an European style church being transformed into an American style skyscraper.
Another thing we can try is to experiment with subtracting stylistic features. We can for example subtract Manhattan’s label from Amsterdam’s, and intuitively that should remove typical Manhattan features from the Amsterdam image. Indeed, we do just that in Figure 10, and observe that the modern looking building in the background in Amsterdam completely disappears, after subtracting Manhattan. Also, if you subtract Manhattan, more trees appear. Apparently, in Manhattan suppresses the output of trees because there are not many there.
In this paper we sought to generate images of facades containing the stylistic features of specific cities. We experimented with multiple GAN architectures, and propose a custom architecture. Our custom architecture produced more realistic images than the alternative architectures, and was capable of generating architectural features specific to each city. By using a continuous label space, we were also able to transition one facade between cities, or do averages over them.
The results suggest that our architecture is also better suited for other tasks than the existing architectures. Investigating its effects on other tasks is a feasible future research path.
We have made all our code and used data available, so that the community can both reproduce and improve on our results.
The Titan Xp used for this research was donated by the NVIDIA Corporation.