Modelling urban networks using Variational Autoencoders

A long-standing question for urban and regional planners pertains to the ability to describe urban patterns quantitatively. Cities' transport infrastructure, particularly street networks, provides an invaluable source of information about the urban patterns generated by peoples' movements and their interactions. With the increasing availability of street network datasets and the advancements in deep learning methods, we are presented with an unprecedented opportunity to push the frontiers of urban modelling towards more data-driven and accurate models of urban forms. In this study, we present our initial work on applying deep generative models to urban street network data to create spatially explicit urban models. We based our work on Variational Autoencoders (VAEs) which are deep generative models that have recently gained their popularity due to the ability to generate realistic images. Initial results show that VAEs are capable of capturing key high-level urban network metrics using low-dimensional vectors and generating new urban forms of complexity matching the cities captured in the street network data.


Introduction
Temporal and spatial patterns of human interactions shape our cities making them unique, but, at the same time, create universal processes that make urban structures comparable to each other. A long-standing effort of urban studies focuses on the creation of quantitative models of the spatial forms of cities that would capture their essential characteristics and enable data-driven comparisons. There have been several attempts at studying urban forms using quantitative methods, typically based on complexity theory or network science [2,3,18,4,5,15,21]. The approaches create an abstract representation of an urban form to derive its key quantitative characteristics. Although theoretically robust, the abstractions might often be too simplistic to capture the full breadth and complexity of existing urban structures.
With the increasing availability of urban street network data and the advancements in deep learning methods, we are presented with an unprecedented opportunity to push the frontiers of urban modelling towards more data-driven and accurate urban models. In this study, we present our initial work on applying deep generative models to urban street network data to create spatially explicit models of urban networks. We based our work on Variational Autoencoders (VAEs) trained on images of street networks. VAEs are deep generative models that have recently gained their popularity due to the ability to generate realistic images. VAEs have two fundamental qualities that make them particularly suitable for urban modelling. Firstly, they can condense high dimensional images of urban street networks to a low-dimensional representation which enables quantitative comparisons between urban forms without any prior assumptions. Secondly, VAEs can generate new realistic urban forms that capture the diversity of existing cities.
In the following sections, we show our experiments based on urban street networks from Open Street Map (OSM). The results indicate that VAE trained on the OSM data is capable of capturing critical high-level urban metrics using low-dimensional vectors. The model can also generate new urban forms of structure matching the cities captured in the OSM dataset. All code and experiments for this study are available at https://github.com/kirakowalska/vaeurban-network.

Variational Autoencoder
Variational Autoencoders (VAEs) have emerged as one of the most popular deep learning techniques for unsupervised learning of complicated data distributions. VAEs are particularly appealing because they compress data into a lower-dimensional representation which can be used for quantitative comparisons and new data generation. VAEs are built on top of standard function approximators (neural networks) efficiently trained with stochastic gradient Fig. 1 Variational Autoencoder takes as input an image of the street network (left), condenses the image to a lower-dimensional encoding (middle) and finally reconstructs the image given the encoding (right). descent [9]. VAEs have already been used to generate many kinds of complex data, including handwritten digits, faces, house numbers, and predicting the future from static images. In this work, we apply VAEs to street network images to learn low-dimensional representations of street networks. We use the representations to make quantitative comparisons between urban forms without making any prior assumptions and to generate new realistic urban forms.
A variational autoencoder consists of an encoder, a decoder, and a loss function. The encoder is a neural network. Its input is a datapoint x, its output is a hidden representation z, and it has weights and biases θ. The goal of the encoder is to 'encode' the data into a latent (hidden) representation space z, which has much fewer dimensions that the data. This is typically referred to as a 'bottleneck' because the encoder must learn an efficient compression of the data into this lower-dimensional space. The encoder is denoted by q φ (z|x).
The decoder is another neural network. Its input is the representation z, it outputs a data point x, and has weights and biases φ. The decoder is denoted by p φ (x|z). The decoder 'decodes' the low-dimensional latent representation z into the datapoint x. Information is lost in the process because the decoder translates from a smaller to a larger dimensionality. How much information is lost? The information loss is measured using the reconstruction log-likelihood log p φ (x|z). The measure indicates how effectively the decoder has learned to reconstruct an input image x given its latent representation z.
The loss function of the variational autoencoder is the sum of the reconstruction loss, given by the negative log-likelihood, and a regularizer. The total loss is the sum of losses N i=1 l i for N datapoints, where the loss function l i for datapoint x i is: The first term is the reconstruction loss or expected negative log-likelihood of the i -th data point. This term encourages the decoder to learn to reconstruct the data. Poor reconstruction of the data x from its latent representation z will incur a large cost in this loss term. The second term is a regularizer that we introduce to ensure that the distribution of the latent values z approaches the prior distribution p(z) specified as a Normal distribution with mean zero and variance one. The regularizer is the Kullback-Leibler divergence between the encoder's distribution q θ (z|x) and p(z). It measures how close q is to p. The regularizer ensures that the representations z of each data point are sufficiently diverse and distributed approximately according to a normal distribution, from which we can easily sample.
The variational autoencoder is trained using gradient descent to optimize the loss with respect to the parameters of the encoder and decoder θ and φ.
In our work, we selected Convolutional Neural Networks (CNNs) [7,12] as the encoder and decoder architectures. CNNs are deep learning architectures that are particularly well-suited to image data [11,10] as they consider the twodimensional structure of images and scale well to high-dimensional images. We tested several CNN architectures and finally chose a network architecture in Figure 2 with the encoder and the decoder architectures consisting of four convolutional blocks, each with a convolutional and a rectified linear unit (ReLU) layer (which introduces non-linearity to the network). The architecture takes as input an image of size 64 x 64 pixels, convolves the image through the encoder network and then condenses it to a 32-dimensional latent representation. The decoder then reconstructs the original image from the condensed latent representation. We implemented the variational autoencoder using PyTorch library for Python.

Street Network Data
The street networks used for model training and testing were obtained from OpenStreetMap [8] by ranking world cities by 2017 population and then selecting the ones with more than 500,000 inhabitants, for a total of 1059 cities 1 . We saved the street networks as images and, as the Variational autoencoders required images to have a fixed spatial scale, we extracted a 3 x 3km sample from the centre of each city image and resized it to a 64 x 64 pixels binary image. The final dataset contained 1,059 binary images of 64 x 64 pixels, which we split into 80% training and 20% testing datasets. During model training, we augmented the training dataset by randomly cropping and flipping the images horizontally. Figure 3 shows images for randomly selected cities.

Reconstruction quality
The variational autoencoder was trained to minimise the loss function defined in (1). The training is equivalent to minimising the image reconstruction loss, subject to a regularizer. We can inspect the training quality by visually comparing reconstructed images to their original counterparts. Figure 4 shows several examples of reconstructed images of urban street networks. As observed in the examples, the trained autoencoder performs well at reconstructing the overall shape of road networks and their main roads. The quality of the reconstruction drops for very dense road networks when only the overall network shape is captured by the autoencoder (see the leftmost image in Figure 4). The observation suggests that variational autoencoders are better suited for reconstructing images with wide patches of pixels with similar properties rather than narrow stretches such as roads.

Urban networks comparison
The trained autoencoder learnt mapping from the space of street network images (64 x 64 or 4,096 dimensions) to a lower dimensional latent space (32 dimensions). The latent representation stores all the information required to reconstruct the original image of the street network, so it is effectively a condensed representation of the street network that preserves all its connectivity and spatial information. In the lack of well-defined similarity metrics of urban networks, this paper uses the condensed representations as vectors of street network features. Hereafter, we call the vectors urban network vectors. Urban network vectors can be used to measure the similarity between different street network forms and to perform further similarity analysis, such as clustering.
Similarity analysis Firstly, we demonstrated the use of urban network vectors for measuring similarity between urban street forms. We measured the similarity between pairs of vectors as the Euclidean distance. Given two urban network vectors p = (p 1 , p 2 , ..., p n ) and q = (q 1 , q 2 , ..., q n ), where n = 32 is the size of the latent space z, the Euclidean distance between p and q is defined as: (2) Figure 5 shows randomly chosen street networks (top row) and their most similar networks based on the Euclidean distance between their urban street networks. As shown in the figure, the proposed methodology enables finding street networks with matching properties, such as network density, spatial structure and orientation without explicitly including any of the properties in the similarity computation.
Clustering Secondly, we used the urban network vectors to detect clusters of similar urban street forms. We used the K-means clustering algorithm [22]. It is a popular clustering approach that assigns data points to K clusters based on distances to cluster centroids. The algorithm requires specifying the number of clusters K a priori. We identified K = 3 as the optimal number of clusters for the street image data using the elbow method [6]. As shown in Figure 6a, the obtained clusters seem to separate street networks based on their density only, failing to reflect more subtle network differences, such as road connectivity or road shapes. When we increased the number of clusters to K = 6 in Figure 6b, we could differentiate road networks based on more subtle network characteristics, such as disconnectedness of roads in the first cluster (top-left in Figure 6b) or large gaps in road provision in the second cluster (top-centre in Figure 6b). We visualised both cluster assignments in Fig. 5 Street network images (top row) with most similar street networks (rows below) based on the Euclidean distance between their urban network vectors. The latent representations, obtained using the trained encoder, seem to capture well network properties such as density, orientation or road shape. Figure 6 (right) by projecting the thirty-two-dimensional urban network vectors to a two-dimensional grid using T-SNE algorithm [14] for dimensionality reduction. The visualisations shows that street networks naturally cluster into three groups that were detected by the K-means algorithm. The three clusters are further mapped in Figure 7 to investigate spatial patterns in urban form variation.

Urban networks generation
In Section 3.2, we used the autoencoder to compress real street images to lowdimensional vectors which we then used to make quantitative comparisons. This employed one strength of variational autoencoders: the ability to encode high-dimensional observations as meaningful low-dimensional representations. The second strength pertains to the ability to generate realistic urban street forms that match the complexity of urban forms across the globe. The ability could potentially advance the current state-of-the-art in simulations of urban forms and socio-economic processes taking place on urban networks.
To generate a synthetic urban network, we firstly sample an embedding value z from the prior distribution p(z) specified as a standard Gaussian (see Section 2.1) and then pass the value through the decoder network to obtain a corresponding image. Images corresponding to several embedding samples are shown in Figure 8. As shown in the figure, the generated images lack the detail of real street images in Figure 3. Although the samples follow the general  structure of road networks with major roads and areas of mixed-density minor roads, the decoder fails to reconstruct details of dense road segments and instead represents them blurred. The problem must be accredited to too few images used in the study. Although the proposed model is flexible enough to model urban street networks, which is confirmed by high-quality reconstructions of real images in Figure 4, it does not see enough images to learn to interpolate between them to sample new forms of street networks to sufficient detail. Fig. 7 Distribution of urban street forms across the globe. Each dot represents a city and is colour-coded according to cluster memberships in Figure 6a. Despite limited data size, spatial trends start to emerge, such as the concentration of high-density urban networks in California, USA (red cluster) and low-density urban networks in south-eastern Asia (black cluster).

Discussion and conclusions
This study is an early exploration of how modern generative machine learning models such as variational autoencoders could augment our ability to model urban forms. With the ability to extract key urban features from highdimensional urban imagery, variational autoencoders open new avenues to integrating high-dimensional data streams in urban modelling. The study considered images of street networks, but the proposed methodology could be equally applied to other image data, such as urban satellite imagery.
Variational autoencoders were selected among deep generative models [16,1] due to their two capabilities: firstly to condense images to low-dimensional representations, secondly to generate new previously unseen images that match the complexity of observed images. The first capability enabled us to extract key urban metrics from street network images, the second gave us the power to generate realistic images of previously unseen urban networks.
Our results, based on 1,059 city images across the globe, showed that VAEs successfully condensed urban images into low-dimensional urban network vectors. This enabled quantitative similarity analysis between urban forms, such as clustering. What is more, VAEs managed to generate new urban forms with complexity matching that of the observed data. Unfortunately, the resolution of the generated images was low which was accredited to the small size of the dataset. Future work will repeat model training on a much larger corpus of images to improve the generative quality.
Despite the promising results, the study opens essential questions for future work. The first question pertains to the black-box nature of deep learning models that lack comprehensive human interpretability. This limitation is already receiving much attention in the deep learning literature [19,20,13]. In this study, the limitation manifests itself in our lack of understanding of how latent space representations of urban networks relate to established network metrics [17]. A related question refers to the ability to evaluate the quality of model outputs, i.e. latent representations and synthetic images. Again, quality assessment of deep generative models is a hot topic in the broader deep learning research community (see for example [23]). Future work could address the problem from the perspective of urban network science.

Availability of data and materials
All data and program source code described in this article is available to any interested parties. The source code and experiments are available at GitHub at the following URL: https://github.com/kirakowalska/vae-urban-network. The raw data and datasets generated during this study are available upon request.