1 Introduction
Temporal and spatial patterns of human interactions shape our cities making them unique, but, at the same time, create universal processes that make urban structures comparable to each other. A longstanding effort of urban studies focuses on the creation of quantitative models of the spatial forms of cities that would capture their essential characteristics and enable datadriven comparisons. There have been several attempts at studying urban forms using quantitative methods, typically based on complexity theory or network science arcaute2016cities ; barthelemy2008modeling ; murcio2015multifrac ; buhl2006topological ; cardillo2006structural ; masucci2009random ; strano2013urban . The approaches create an abstract representation of an urban form to derive its key quantitative characteristics. Although theoretically robust, the abstractions might often be too simplistic to capture the full breadth and complexity of existing urban structures.
With the increasing availability of urban street network data and the advancements in deep learning methods, we are presented with an unprecedented opportunity to push the frontiers of urban modelling towards more datadriven and accurate urban models. In this study, we present our initial work on applying deep generative models to urban street network data to create spatially explicit models of urban networks. We based our work on Variational Autoencoders (VAEs) trained on images of street networks. VAEs are deep generative models that have recently gained their popularity due to the ability to generate realistic images. VAEs have two fundamental qualities that make them particularly suitable for urban modelling. Firstly, they can condense high dimensional images of urban street networks to a lowdimensional representation which enables quantitative comparisons between urban forms without any prior assumptions. Secondly, VAEs can generate new realistic urban forms that capture the diversity of existing cities.
In the following sections, we show our experiments based on urban street networks from Open Street Map (OSM). The results indicate that VAE trained on the OSM data is capable of capturing critical highlevel urban metrics using lowdimensional vectors. The model can also generate new urban forms of structure matching the cities captured in the OSM dataset. All code and experiments for this study are available at https://github.com/kirakowalska/vaeurbannetwork.
2 Methodology and dataset
2.1 Variational Autoencoder
Variational Autoencoders (VAEs) have emerged as one of the most popular deep learning techniques for unsupervised learning of complicated data distributions. VAEs are particularly appealing because they compress data into a lowerdimensional representation which can be used for quantitative comparisons and new data generation. VAEs are built on top of standard function approximators (neural networks) efficiently trained with stochastic gradient descent
kingma2013auto . VAEs have already been used to generate many kinds of complex data, including handwritten digits, faces, house numbers, and predicting the future from static images. In this work, we apply VAEs to street network images to learn lowdimensional representations of street networks. We use the representations to make quantitative comparisons between urban forms without making any prior assumptions and to generate new realistic urban forms.A variational autoencoder consists of an encoder, a decoder, and a loss function. The
encoder is a neural network. Its input is a datapoint x, its output is a hidden representation
z, and it has weights and biases . The goal of the encoder is to ’encode’ the data into a latent (hidden) representation space , which has much fewer dimensions that the data. This is typically referred to as a ’bottleneck’ because the encoder must learn an efficient compression of the data into this lowerdimensional space. The encoder is denoted by .The decoder is another neural network. Its input is the representation , it outputs a data point , and has weights and biases . The decoder is denoted by . The decoder ’decodes’ the lowdimensional latent representation into the datapoint . Information is lost in the process because the decoder translates from a smaller to a larger dimensionality. How much information is lost? The information loss is measured using the reconstruction loglikelihood . The measure indicates how effectively the decoder has learned to reconstruct an input image given its latent representation .
The loss function of the variational autoencoder is the sum of the reconstruction loss, given by the negative loglikelihood, and a regularizer. The total loss is the sum of losses for datapoints, where the loss function for datapoint is:
(1) 
The first term is the reconstruction loss or expected negative loglikelihood of the ith data point. This term encourages the decoder to learn to reconstruct the data. Poor reconstruction of the data from its latent representation will incur a large cost in this loss term. The second term is a regularizer that we introduce to ensure that the distribution of the latent values approaches the prior distribution
specified as a Normal distribution with mean zero and variance one. The regularizer is the KullbackLeibler divergence between the encoder’s distribution
and . It measures how close is to . The regularizer ensures that the representations of each data point are sufficiently diverse and distributed approximately according to a normal distribution, from which we can easily sample.The variational autoencoder is trained using gradient descent to optimize the loss with respect to the parameters of the encoder and decoder and .
In our work, we selected Convolutional Neural Networks (CNNs)
fukushima1980neocognitron ; lecun1990handwritten as the encoder and decoder architectures. CNNs are deep learning architectures that are particularly wellsuited to image data lecun1995convolutional ; krizhevsky2012imagenet as they consider the twodimensional structure of images and scale well to highdimensional images. We tested several CNN architectures and finally chose a network architecture in Figure 2with the encoder and the decoder architectures consisting of four convolutional blocks, each with a convolutional and a rectified linear unit (ReLU) layer (which introduces nonlinearity to the network). The architecture takes as input an image of size 64 x 64 pixels, convolves the image through the encoder network and then condenses it to a 32dimensional latent representation. The decoder then reconstructs the original image from the condensed latent representation. We implemented the variational autoencoder using PyTorch library for Python.
2.2 Street Network Data
The street networks used for model training and testing were obtained from OpenStreetMap haklay2008openstreetmap by ranking world cities by 2017 population and then selecting the ones with more than 500,000 inhabitants, for a total of 1059 cities^{1}^{1}1We compiled the list of cities from the UN data website http://data.un.org (accessed December 2018) . We saved the street networks as images and, as the Variational autoencoders required images to have a fixed spatial scale, we extracted a 3 x 3km sample from the centre of each city image and resized it to a 64 x 64 pixels binary image. The final dataset contained 1,059 binary images of 64 x 64 pixels, which we split into 80% training and 20% testing datasets. During model training, we augmented the training dataset by randomly cropping and flipping the images horizontally. Figure 3 shows images for randomly selected cities.
3 Results
3.1 Reconstruction quality
The variational autoencoder was trained to minimise the loss function defined in (1). The training is equivalent to minimising the image reconstruction loss, subject to a regularizer. We can inspect the training quality by visually comparing reconstructed images to their original counterparts. Figure 4 shows several examples of reconstructed images of urban street networks. As observed in the examples, the trained autoencoder performs well at reconstructing the overall shape of road networks and their main roads. The quality of the reconstruction drops for very dense road networks when only the overall network shape is captured by the autoencoder (see the leftmost image in Figure 4). The observation suggests that variational autoencoders are better suited for reconstructing images with wide patches of pixels with similar properties rather than narrow stretches such as roads.
3.2 Urban networks comparison
The trained autoencoder learnt mapping from the space of street network images (64 x 64 or 4,096 dimensions) to a lower dimensional latent space (32 dimensions). The latent representation stores all the information required to reconstruct the original image of the street network, so it is effectively a condensed representation of the street network that preserves all its connectivity and spatial information. In the lack of welldefined similarity metrics of urban networks, this paper uses the condensed representations as vectors of street network features. Hereafter, we call the vectors urban network vectors. Urban network vectors can be used to measure the similarity between different street network forms and to perform further similarity analysis, such as clustering.
Similarity analysis
Firstly, we demonstrated the use of urban network vectors for measuring similarity between urban street forms. We measured the similarity between pairs of vectors as the Euclidean distance. Given two urban network vectors and , where is the size of the latent space , the Euclidean distance between and is defined as:
(2) 
Figure 5 shows randomly chosen street networks (top row) and their most similar networks based on the Euclidean distance between their urban street networks. As shown in the figure, the proposed methodology enables finding street networks with matching properties, such as network density, spatial structure and orientation without explicitly including any of the properties in the similarity computation.
Clustering
Secondly, we used the urban network vectors to detect clusters of similar urban street forms. We used the Kmeans clustering algorithm witten2016data . It is a popular clustering approach that assigns data points to clusters based on distances to cluster centroids. The algorithm requires specifying the number of clusters a priori. We identified as the optimal number of clusters for the street image data using the elbow method dangeti2017statistics . As shown in Figure (a)a, the obtained clusters seem to separate street networks based on their density only, failing to reflect more subtle network differences, such as road connectivity or road shapes. When we increased the number of clusters to in Figure (b)b, we could differentiate road networks based on more subtle network characteristics, such as disconnectedness of roads in the first cluster (topleft in Figure (b)b) or large gaps in road provision in the second cluster (topcentre in Figure (b)b). We visualised both cluster assignments in Figure 6 (right) by projecting the thirtytwodimensional urban network vectors to a twodimensional grid using TSNE algorithm maaten2008visualizing for dimensionality reduction. The visualisations shows that street networks naturally cluster into three groups that were detected by the means algorithm. The three clusters are further mapped in Figure 7 to investigate spatial patterns in urban form variation.
3.3 Urban networks generation
In Section 3.2, we used the autoencoder to compress real street images to lowdimensional vectors which we then used to make quantitative comparisons. This employed one strength of variational autoencoders: the ability to encode highdimensional observations as meaningful lowdimensional representations. The second strength pertains to the ability to generate realistic urban street forms that match the complexity of urban forms across the globe. The ability could potentially advance the current stateoftheart in simulations of urban forms and socioeconomic processes taking place on urban networks.
To generate a synthetic urban network, we firstly sample an embedding value from the prior distribution specified as a standard Gaussian (see Section 2.1) and then pass the value through the decoder network to obtain a corresponding image. Images corresponding to several embedding samples are shown in Figure 8. As shown in the figure, the generated images lack the detail of real street images in Figure 3. Although the samples follow the general structure of road networks with major roads and areas of mixeddensity minor roads, the decoder fails to reconstruct details of dense road segments and instead represents them blurred. The problem must be accredited to too few images used in the study. Although the proposed model is flexible enough to model urban street networks, which is confirmed by highquality reconstructions of real images in Figure 4
, it does not see enough images to learn to interpolate between them to sample new forms of street networks to sufficient detail.
4 Discussion and conclusions
This study is an early exploration of how modern generative machine learning models such as variational autoencoders could augment our ability to model urban forms. With the ability to extract key urban features from highdimensional urban imagery, variational autoencoders open new avenues to integrating highdimensional data streams in urban modelling. The study considered images of street networks, but the proposed methodology could be equally applied to other image data, such as urban satellite imagery.
Variational autoencoders were selected among deep generative models moosavi2017urban ; albert2018modeling due to their two capabilities: firstly to condense images to lowdimensional representations, secondly to generate new previously unseen images that match the complexity of observed images. The first capability enabled us to extract key urban metrics from street network images, the second gave us the power to generate realistic images of previously unseen urban networks.
Our results, based on 1,059 city images across the globe, showed that VAEs successfully condensed urban images into lowdimensional urban network vectors. This enabled quantitative similarity analysis between urban forms, such as clustering. What is more, VAEs managed to generate new urban forms with complexity matching that of the observed data. Unfortunately, the resolution of the generated images was low which was accredited to the small size of the dataset. Future work will repeat model training on a much larger corpus of images to improve the generative quality.
Despite the promising results, the study opens essential questions for future work. The first question pertains to the blackbox nature of deep learning models that lack comprehensive human interpretability. This limitation is already receiving much attention in the deep learning literature lime ; shrikumar2017learning ; lundberg2017unified . In this study, the limitation manifests itself in our lack of understanding of how latent space representations of urban networks relate to established network metrics newman2010networks . A related question refers to the ability to evaluate the quality of model outputs, i.e. latent representations and synthetic images. Again, quality assessment of deep generative models is a hot topic in the broader deep learning research community (see for example wu2016quantitative ). Future work could address the problem from the perspective of urban network science.
5 Declarations
Availability of data and materials
All data and program source code described in this article is available to any interested parties. The source code and experiments are available at GitHub at the following URL: https://github.com/kirakowalska/vaeurbannetwork. The raw data and datasets generated during this study are available upon request.
Competing interests
The authors declare that they have no competing interests.
Funding
There is no specific funding received for the study.
Authors’ contributions
KK designed and implemented the methodology, executed the computer runs, and wrote the initial version of the article. RM prepared street network data and extensively revised the article. Both authors read and approved the final manuscript.
Acknowledgements
The authors would like to thank Szymon Zareba and Adam Gonczarek (Alphamoon Ltd) for advice on deep generative models during the course of the project.
Authors’ information
KK is a lecturer in geospatial machine learning at the Bartlett’s Centre for Advanced Spatial Analysis, University College London, UK and a machine learning researcher at Alphamoon, PL. She develops machine learning algorithms for urban modelling and sensor data mining. Her research interests include geospatial data mining, sensor data fusion and machine learning for sensor networks.
RM is a senior research fellow at the Bartlett’s Centre for Advanced Spatial Analysis, University College London, UK. His academic interests include urban complex networks, information transfer in social systems, spatial interaction models and pedestrian flows. One of his main research topics is the application of multifractal measures to different urban aspects, such as street networks and social inequality.
References
 (1) Albert, A., Strano, E., Kaur, J., González, M.: Modeling urbanization patterns with generative adversarial networks. In: IGARSS 20182018 IEEE International Geoscience and Remote Sensing Symposium, pp. 2095–2098. IEEE (2018)
 (2) Arcaute, E., Molinero, C., Hatna, E., Murcio, R., VargasRuiz, C., Masucci, A.P., Batty, M.: Cities and regions in britain through hierarchical percolation. Royal Society open science 3(4), 150691 (2016). DOI https://doi.org/10.1098/rsos.150691
 (3) Barthélemy, M., Flammini, A.: Modeling urban street patterns. Physical review letters 100(13), 138702 (2008)
 (4) Buhl, J., Gautrais, J., Reeves, N., Solé, R., Valverde, S., Kuntz, P., Theraulaz, G.: Topological patterns in street networks of selforganized urban settlements. The European Physical Journal BCondensed Matter and Complex Systems 49(4), 513–522 (2006)
 (5) Cardillo, A., Scellato, S., Latora, V., Porta, S.: Structural properties of planar graphs of urban street patterns. Physical Review E 73(6), 066107 (2006)
 (6) Dangeti, P.: Statistics for machine learning. Packt Publishing Ltd (2017)

(7)
Fukushima, K.: Neocognitron: A selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position.
Biological cybernetics 36(4), 193–202 (1980)  (8) Haklay, M., Weber, P.: Openstreetmap: Usergenerated street maps. IEEE Pervasive Computing 7(4), 12–18 (2008)
 (9) Kingma, D.P., Welling, M.: Autoencoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

(10)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks.
In: Advances in neural information processing systems, pp. 1097–1105 (2012)  (11) LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10), 1995 (1995)
 (12) LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., Jackel, L.D.: Handwritten digit recognition with a backpropagation network. In: Advances in neural information processing systems, pp. 396–404 (1990)
 (13) Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, pp. 4765–4774 (2017)
 (14) Maaten, L.v.d., Hinton, G.: Visualizing data using tsne. Journal of machine learning research 9(Nov), 2579–2605 (2008)
 (15) Masucci, A.P., Smith, D., Crooks, A., Batty, M.: Random planar graphs and the london street network. The European Physical Journal B 71(2), 259–271 (2009)
 (16) Moosavi, V.: Urban morphology meets deep learning: Exploring urban forms in one million cities, town and villages across the planet. arXiv preprint arXiv:1709.02939 (2017)
 (17) Newman, M.: Networks: an introduction. Oxford university press (2010)
 (18) R, M., P, M.A., E, A., Batty, M.: Multifractal to monofractal evolution of the london street network. Phys Rev E 92(6), 2130 (2015). DOI https://doi.org/10.1103/PhysRevE.92.062130

(19)
Ribeiro, M.T., Singh, S., Guestrin, C.: ”why should I trust you?”: Explaining the predictions of any classifier.
In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 1317, 2016, pp. 1135–1144 (2016)  (20) Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3145–3153. JMLR. org (2017)
 (21) Strano, E., Viana, M., da Fontoura Costa, L., Cardillo, A., Porta, S., Latora, V.: Urban street networks, a comparative analysis of ten european cities. Environment and Planning B: Planning and Design 40(6), 1071–1086 (2013)
 (22) Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2016)
 (23) Wu, Y., Burda, Y., Salakhutdinov, R., Grosse, R.: On the quantitative analysis of decoderbased generative models. arXiv preprint arXiv:1611.04273 (2016)
Comments
There are no comments yet.