Positional Encoding Augmented GAN for the Assessment of Wind Flow for Pedestrian Comfort in Urban Areas

by   Henrik Høiness, et al.

Approximating wind flows using computational fluid dynamics (CFD) methods can be time-consuming. Creating a tool for interactively designing prototypes while observing the wind flow change requires simpler models to simulate faster. Instead of running numerical approximations resulting in detailed calculations, data-driven methods in deep learning might be able to give similar results in a fraction of the time. This work rephrases the problem from computing 3D flow fields using CFD to a 2D image-to-image translation-based problem on the building footprints to predict the flow field at pedestrian height level. We investigate the use of generative adversarial networks (GAN), such as Pix2Pix [1] and CycleGAN [2] representing state-of-the-art for image-to-image translation task in various domains as well as U-Net autoencoder [3]. The models can learn the underlying distribution of a dataset in a data-driven manner, which we argue can help the model learn the underlying Reynolds-averaged Navier-Stokes (RANS) equations from CFD. We experiment on novel simulated datasets on various three-dimensional bluff-shaped buildings with and without height information. Moreover, we present an extensive qualitative and quantitative evaluation of the generated images for a selection of models and compare their performance with the simulations delivered by CFD. We then show that adding positional data to the input can produce more accurate results by proposing a general framework for injecting such information on the different architectures. Furthermore, we show that the models performances improve by applying attention mechanisms and spectral normalization to facilitate stable training.



There are no comments yet.


page 8

page 10

page 13

page 15

page 17

page 19

page 20

page 21


Procedural 3D Terrain Generation using Generative Adversarial Networks

Procedural 3D Terrain generation has become a necessity in open world ga...

Can Giraffes Become Birds? An Evaluation of Image-to-image Translation for Data Generation

There is an increasing interest in image-to-image translation with appli...

Cartoon-to-real: An Approach to Translate Cartoon to Realistic Images using GAN

We propose a method to translate cartoon images to real world images usi...

Pedestrian Wind Factor Estimation in Complex Urban Environments

Urban planners and policy makers face the challenge of creating livable ...

An Efficient Image-to-Image Translation HourGlass-based Architecture for Object Pushing Policy Learning

Humans effortlessly solve pushing tasks in everyday life but unlocking t...

A framework for data-driven solution and parameter estimation of PDEs using conditional generative adversarial networks

This work is the first to employ and adapt the image-to-image translatio...

Statistical Analysis Of NYC Buildings And Wind Damages

The objective of this study is to determine the types of existing buildi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When designing urban spaces, there is a need to understand the wind environment before the buildings are built to ensure a comfortable environment for the inhabitants. This topic has received significant attention in wind engineering and the scientific literature. Today there are two main methods to analyze the wind environment; experiments in a wind tunnel and simulations using CFD [BLOCKEN201215, JANSSEN2013547, BLOCKEN2009255]. Performing experiments is considered the most accurate method to evaluate the wind environment. CFD simulations are increasingly being accepted as a viable alternative to experiments, thanks to extensive efforts in improving the methodology and validating results against wind tunnel experiments and full-scale measurements. However, even though simulations can be cheaper than performing experiments, it still represents a significant cost.

When designing a new building, it is desirable to see how small changes in the design change the wind flow patterns. Moving a structure closer to its neighbor or adjusting its height would require repeatedly re-running the whole CFD simulation or multiple simulations from several wind directions. In the early stages of the design process, the an architect might accept slightly lower accuracy in the wind predictions if this means a faster design iteration time.

In that case, there might exist less time-consuming methods within the field of machine learning, and in particular, deep learning methods that can potentially allow a more interactive evaluation of the wind environment.

There are several examples of the combination of CFD and deep learning in recent literature. CFDNet [2020cfdnet] introduces a physical simulation and deep learning coupled framework for accelerating RANS simulations by adding an iterative refinement stage consisting of a CNN in-between the warmup and refinement stage of a physical solver. In this way, they significantly accelerate the convergence of the overall scheme. The model is tested on different geometries unseen during training to evaluate how well their method generalize. Experiments showed that the CFDNet still performed fine and was able to perform accurate predictions. This work indicates that combining physical models with data-driven machine learning models could be a promising approach for accelerating simulations. In addition, [Thuerey_2020] compare the accuracy of physical solvers with surrogate models using a modified version of the U-Net architecture [ronneberger2015unet] trained in a supervised manner. In contrast to CFDNet, their method is an end-to-end surrogate entirely driven by the neural network. Leaving out the physical constraints is something they do intentionally and instead choose to focus on working with state-of-the-art CNN and executes detailed evaluations. One problem with models like these is that they can not guarantee that the predictions meet the necessary constraints of the traditional physics-based algorithms. On the other side, this makes it a more generic approach applicable for various other equations beyond RANS. Similarly, [Bhatnagar_2019] shows how they input the SDF as input for a CNN autoencoder to train a surrogate model for CFD simulations around different shaped airfoils. SDF works efficiently with neural networks for shape learning and is widely used in applications as rendering and segmentation and extracting structural information of different geometries, essentially providing a universal representation of the different shapes. All the mentioned methods have in common that they have a multi-channel output consisting of velocity and pressure.

In recent years, Generative Adversarial Networks [goodfellow2014generative] has been shown to be an efficient unsupervised method for learning the underlying distribution of a given dataset. Different extensions of the original architecture have been proposed. StyleGAN [stylegan], for example, is capable of learning to generate fake human faces indistinguishable from real faces. One could think that there exists a function that maps any geometrical shape to a version that exists its corresponding flow fields. The surrogate model ffsGAN presented in [supercritical-cfd]

is trying to do just that. They propose a model that leverages the property of cGANs combined with CNNs to directly establish a one-to-one mapping from a given supercritical airfoil to its corresponding flow field structure. Unlike other methods that use an encoder to map the input to lower-dimensional space, they have a way to parameterize the airfoils as a 14-dimensional vector. While their approach is promising, FlowGAN

[flowgan] shows how they customize U-Net as the generator to include the Reynolds number and angle of attack. The flow parameters are concatenated with the geometry parameters extracted by the encoder of U-Net before they are passed to an MLP network to perform a nonlinear input-output mapping. The output of the MLP network is what is being decoded by the generator network. In this way, they provide a method for generating solutions to flow fields in various conditions based on observations rather than re-training.

Traditional CFD methods produce high-accuracy results, but they are computationally expensive and do not work well in the design process of new prototypes in a given domain. To obtain results, it often takes several hours or days, depending on the prototype’s complexity. Deep learning methods can help create an interactive tool for testing new designs, even when they are getting computationally hard for physical solvers. The experiments in [Guo2016ConvolutionalNN] demonstrate that CNN can work as a surrogate model for physical solver both when given discrete 2D and 3D bluff shapes. These experiments differ from other papers mentioned, as they are focusing more on the interactive application aspect for prototype design of different kinds of bluff shapes. In the 3D domain, [neurips_3d] proposes an architecture based on residual CNN for CFD prediction using 3D convolutions, enabling them to offer designers an interactive tool for prototyping. The dataset they use consists of various geometries representing samples of urban structures. In this way, they can create an interactive tool that could be used for city planners. A unique feature of their tool is that they offer a network trained in reverse, where you input the target wind flow, and it outputs the urban volumes that will produce it. One of their limitations is that only one direction of the wind flow is considered. Usually, we should consider multiple directions to create a representative forecast. Lastly, [regression-cfd] proposes a regression model using Gaussian Process to predict how fluid flows around three-dimensional objects interactively. In general, it is challenging to handle detailed 3D shapes in a data-driven manner using machine learning approaches. It requires a consistent parameterization of the input and output of the model. To do this, they purpose PolyCube maps-based parametrization that can be computed at high rates and allow their method to work efficiently even when doing interactive design and optimization during prototyping. More in detail, in our work, we formulate the wind flow prediction as an image-to-image translation problem [isola2018imagetoimage], and we explore the potential of the most advanced GAN-based architectures for such a task.

The main contributions of this work are as follows:

  1. We rephrase the problem from computing 3D flow fields using computational fluid dynamics to a 2D image-to-image translation-based problem on the building footprints to predict the flow field at pedestrian height level.

  2. We propose state-of-the-art GAN architectures for the image-to-image translation process. The generator can generate realistic-looking wind flows assessment conditioned on a given geometry input of various amounts of buildings.

  3. We perform a systematic comparison of the most advanced GANs experimenting on several new datasets of various bluff-shaped buildings generated using computational fluid dynamics methods. Further, we also perform an experimental study on buildings with different heights and a systematic generalization experiment to optimize a model on one of the datasets and investigate how the model performs on the others.

  4. We propose a novel extension of known image-to-image translation methods where we inject different positional information into the architectures using the signed distance function and coordinates of the Cartesian space seen by the convolutional filters.

  5. We conduct an ablation study, through experiments, to test the effect of different attention mechanisms for airflow prediction.

  6. We optimize models for a real scenario, using buildings from a built-up city environment. This allows us to analyze the problem at a more applied and complex scale than previous work.

We organize the paper as follows. In section 2 we clarify the problem we are trying to solve while introducing the architectures’ implementation details and their corresponding objectives. In section 3 we introduce the dataset and metrics used for training and evaluating the models. Experimental results and discussion are given in section 4, and the conclusion is drawn in section 5.

2 Methods

This section clarifies the problem we are trying to solve while introducing the network architectures being compared and their corresponding objectives. We then introduce the two kinds of positional information we propose to add and describe the optimization and training details. Lastly, we examine spectral normalization and the attention mechanism.

2.1 Problem formulation

Given a building’s 3D geometry, we simplify this to a 2D image using grayscale to represent building height. With this, we can formalize the CFD prediction task to an image-to-image translation problem as in [isola2018imagetoimage]. Given our two domains and , building geometry and CFD flow field respectively, we want to learn the mapping functions, , between the image pair , having and . This mapping is visualized in Figure 1. We will compare methods performing this translation using both cGANs and autoencoders. We denote our data as , where . To capture the conditional mapping between the image pairs, not only between the two domains, the GAN receives the building geometries as a condition for the CFD flow field to be generated, i.e., . As the mapping between geometry and CFD simulation is a one-to-one mapping, we would like our model to be deterministic; therefore, we do not give our generators a random noise vector as proposed in [isola2018imagetoimage].

Figure 1: Mapping from domain to .

The generator is optimized to generate outputs that are indistinguishable from the real simulations together with an adversarially trained discriminator , which is trained to discriminate between real CFD simulation, , or from the generator, . The adversarial training procedure is shown in Figure 2.

Figure 2: The adversarial training procedure of and .

We will investigate the data-driven CFD prediction by defining three main frameworks based on the most advanced state of the art methods in computer vision: 1) Pix2Pix architecture

[isola2018imagetoimage], 2) CycleGAN [zhu2020unpaired] and 3) U-Net-based autoencoder [ronneberger2015unet].

2.2 Network architectures

Generative modeling is an unsupervised learning task where the model learns the patterns in the input data to generate outputs that are similar to data from the original dataset it was trained on. Generative adversarial networks are an approach used for generative modeling using deep learning methods. What makes GANs special from other generative models is that they try to solve learning patterns in datasets in a unsupervised manner by having one part of the model generate the data and another part classifying it as real or fake. GANs were first introduced back in 2014 by Ian Goodfellow

[goodfellow2014generative], and one year later, Alec Radford introduced a more stable version using Deep Convolutional Generative Adversarial Networks [radford2016unsupervised], DCGANs. Models like these have advanced from generating low-quality greyscale images of faces to high resolution 1024 by 1024 images, nearly impossible to distinguish from real faces.

A GAN (see Figure 3) consists of a generative network , that tries to capture the underlying data distribution of the dataset the network is trained on, and a discriminative model

that estimates the probability of a sample being from the real distribution than being output from

. The procedure for training a GAN corresponds to a minimax two-player game. Minimax is a term from game theory and forms a strategy for making decisions in a game where the players try to minimize the possible loss from a worst case scenario. In GANs, this principle is used in the network’s training procedure, where the generator’s task is to maximize the probability of D making a mistake. The value function for this game can be defined as:


where is the the distribution of the real data set, while is the generated output from based on the input vector noise

. The vector is randomly drawn from a Gaussian distribution and is what makes

produce different outputs.

Figure 3: A simple illustration of a GAN architecture.

2.2.1 Conditional GANs

In an unconditioned generative model, there is no control of what type of data is being generated. cGAN [mirza2014conditional] can be constructed by simply feeding the data we wish to condition on to both the generator, and the discriminator. Such conditioning could be based on class labels or any other auxiliary information. The additional information is combined with the prior noise vector when passed on to the generator and discriminator. This requires some modifications to the loss in Equation 1 where we need to include the condition as we do in Equation 2.


Image-to-image translation is a graphics problem where the goal is to learn a mapping between an input image and an output image using a dataset built up of pairs of images. This can be learned using cGAN, where the conditional information is the image you want to translate. For the generator to handle this, the image has to be encoded to a one-dimensional vector, as the generator is expecting. To perform the image encoding, it is normal to use a CNN [isola2018imagetoimage], as we will see in the next section.

2.2.2 Pix2Pix

Inspired from cGANs [mirza2014conditional], Pix2Pix [isola2018imagetoimage] uses conditional adversarial networks as a general-purpose solution to image-to-image translation problems, where instead of conditioning on labels, it is conditioning on an input image and generates a corresponding output image.

One of the contributions of this work is to demonstrate the use of conditional GANs in various problems show the effectiveness of a proposed Pix2Pix based architecture for predicting the wind flow. The GAN is built up of a U-Net [ronneberger2015unet] generator and a PatchGAN [isola2018imagetoimage] discriminator. The objective for training the GAN is based on the one presented in cGAN in combination with the L1 distance for regularization:


where is the constant weight for the L1 distance term.

U-Net [ronneberger2015unet] was first introduced for biomedical image segmentation (see Figure 4

). It has an auto-encoder structure consisting of an encoder for contracting the input using convolutions and max-pooling and a decoder for expanding the encoded output using up-sampling operators. To localize high-resolution features from the input, features from the encoder are combined with the features during the decoder’s up-sampling phase. This is called skip-connections and essentially concatenating the channels at a layer with the others. As a result of this, the decoder is more or less symmetric to the encoder, and thereby yields a u-shaped architecture, hence the name U-Net. A network like this, which includes the skip-connections, makes sense in an image-to-image translation like this because it requires a lot of information flow through the layers, including the bottleneck between the encoder and decoder.

Figure 4: U-Net architecture [ronneberger2015unet]

PatchGAN. PatchGAN is used as the discriminator of the network and only penalizes the structure of the images at patches’ scale. The discriminator effectively tries to classify if an patch of an image is real or fake - creating an output matrix consisting of probabilities of whether the patch is real or not. They show that can be much smaller than the image’s size and still produce high-quality results. Smaller patches give fewer parameters, which then run faster, and they can be used on arbitrarily sized images making it a more general approach. The discriminator runs these patches convolutionally over the whole image, averaging all the responses to provide D’s best output.

2.2.3 CycleGAN

CycleGAN [zhu2020unpaired] presents an alternative way of learning such translations where you no longer need pairs of images and when training data are not paired. The goal is to learn the mapping functions between two domains X and Y, given samples from both. What makes CycleGAN different is that they include two mappings, G: and F: in comparison to Pix2Pix, which only includes one. They also introduce two discriminators, and

, one for each domain. For these models to work together, the loss function includes two terms: the adversarial losses for matching the distributions of the generated images close to the real distribution and a cycle consistency loss that prevents the mappings G and F from contradicting each other.

Architecture. For CycleGAN to translate between the two domains and , it uses two generators and two discriminators, one for each translation. The generator they use is similar to the one in Pix2Pix but appends several residual blocks [he2015deep]

between the encoding and decoding blocks. Residual blocks tackle the vanishing/exploding gradient problem to make the generator network even deeper. The residuals blocks are very similar to the skip connections in U-net, but instead of being concatenated as new channels before the convolution, which combines them, it is added directly to the convolution’s output. The approach for using this kind of generator was proposed in


for neural style transfer and super-resolution. As discriminators, the authors used PatchGAN

[isola2018imagetoimage] as introduced in Pix2pix. The full objective for training this architecture is defined as:


where is the least-squares adversarial loss, is the cycle consistency loss, and controls the relative importance of the two objectives.

Least Squares Adversarial Loss. In the adversarial loss described earlier we showed the use negative log likelihood as the objective. CycleGAN replaces this by a least-squares loss [mao2017squares] as it has shown to be more stable during training and generates higher quality results. The new adversarial loss function for the network is expressed as:

Figure 5: Consistency loss in CycleGAN [zhu2020unpaired]

Cycle Consistency Loss. Adversarial losses alone can not guarantee that the learned function can map an individual input to a desired output . CycleGAN argue that the learned mapping functions should be cycle-consistent as shown in Figure 5. The image translation cycle, , from Figure 5(b), shows that the cycle should be able to bring back and output as similar as possible to the original . This is what we call forward cycle consistency, while the cycle going from y is called backward cycle consistency. CycleGAN includes both of these, as the model consists of two generators. The cycle consistency loss uses the L1 loss as defined here:


2.3 Positional information

We propose to augment the proposed architectures, by injecting positional information in form of extra channel. More in detail, as the filters in convolutions are equivariant, we investigate if adding positional embeddings with regard to the buildings would affect the methods’ performance. Below we define different positional information that will be used in our experiments.

2.3.1 SDF - Signed Distance Function

SDF is widely used for rendering and segmentation and works efficiently with neural networks for shape learning [Bhatnagar_2019]. [Guo2016ConvolutionalNN] reports the effectiveness of SDF in representing the geometry shapes for CNNs.

A mathematical formulation of the signed distance function of a set of points from the boundary of a set of objects .


where denotes an object, and measures the shortest distance of each given point from the boundary points of the objects. The SDF will provide a measure of whether a point is inside or outside of an object, and how close it is to the closest object. Figure 6 illustrates the contour plot of the SDF for a geometry sampled from (see Dataset definition later). Visualization of the implemented SDF-layer can be found in Figure 7.

Figure 6: A signed distance function plot for a geometry image sample from . The magnitude of the values in the plot represents the shortest distance to any of the two buildings.
Figure 7: A SDF layer adds an additional layer to our models’ input. This additional information contains the SDF values defined by Equation 7.

2.3.2 CoordConv

The second positional information are related to the coordinates of the Cartesian space.

Convolutions are widely used in modern deep learning architectures. One of its strengths is its property of translational invariance. Hence, regardless of where a feature is present in an image, the same filter can be applied. [liu2018coordconv] proposes an extension of the vanilla convolutions, allowing filters to know where they are in an image. This is achieved by adding two additional channels that contain coordinates of the Cartesian space seen by the convolutional filters. This extension is visualized in Figure 8. More precisely, the i coordinate channel is an rank-1 matrix with its first row filled with 0’s, its second row with 1’s, its third with 2’s, etc. The j coordinate channel is similar, but with columns filled in with constant values instead of rows [liu2018coordconv]. The values are then normalized before concatenating the channels. [liu2018coordconv] propose using the CoordConv architecture for GANs by replacing the first convolutional layer in both the generator and discriminator. Similarly to their approach, we propose to add those two channels as input in the proposed image-to-image translation frameworks.The results of these experiments are illustrated in subsection 4.3.

Figure 8: A CoordConv layer with the the same functional signature as vanilla convolutions, but accomplishes the positional mapping by concatenating two extra channels to the incoming representation. These channels contain hard-coded coordinates, the most basic version of which is one channel for the i coordinate and one for the j coordinate.

2.4 Spectral Normalization

A persisting challenge in the training of GANs is the performance control of the discriminator [miyato2018spectral]. One of the most significant challenges when training GANs is the lack of stability when updating the generator and discriminator’s weights. Spectral normalization is a weight normalization technique used to stabilize the discriminator’s training. It is computationally light, and it is easy to incorporate into other existing GAN architectures. Compared to other regularization techniques like weight normalization, weight clipping, and gradient penalty, spectral normalization has been shown to work better. Using spectral normalization, you control the Lipschitz constant, the maximum absolute value of the derivatives, of the discriminator. It does this by normalizing each network layer’s weights with the spectral norm . By doing this, the Lipschitz constant for the discriminator equals one as shown in Equation 8.


A Lipschitz constant of one means that the maximum absolute value of the derivative must be one. Giving this property to the discriminator makes it more stable during the training of the whole GAN. By constraining the derivative, one makes sure that the discriminator is not learning too fast compared to the generator. You then avoid manually finding a proper balance for updating the two adversarial networks and essentially facilitate better feedback from the discriminator to the generator.

2.5 Attention

In the context of neural networks, attention is a technique that imitates how cognitive attention work, which is the process of concentrating on specific parts of information while ignoring less essential elements. We differentiate between soft- and hard attention [show-and-tell-attention]

, which is necessary to understand when optimizing attention-based neural networks. When training a neural network, you want the model to be smooth and differentiable. Creating a differentiable model requires the weights of the attention layer to be given to all areas of an image, in contradiction to only selecting one patch of the image. Soft attention is when we assign weights to the parts of the whole picture. On the other side, hard attention only sets a patch at a time and can only be trained using methods like reinforcement learning as it is non-differentiable. In this way, using soft attention, the network can essentially filter out the less essential parts of an image and focus its predictions on achieving detailed results in the more critical areas.

This paper focuses on different forms of soft attention used in CNNs to better attend to essential parts of the image-to-image translation process.

2.5.1 Self-attention

The self-attention mechanism relates various positions of the input to compute an attentive representation of the same sequence. [zhang2019selfattention] proposes SAGAN, a GAN architecture that introduces a self-attention mechanism into convolutional GANs. The self-attention module is complementary to convolutions and helps with modeling long-range, multi-level dependencies across image regions. The self-attention feature maps are calculated from the image features of the previous layer. The feature maps are further multiplied with a learnable scale parameter. This learnable parameter allows the model to learn to assign more weight to the attention maps gradually. These maps are then added back to the initial input features maps to filter what sections are more important to attend.

2.5.2 Convolutional Block Attention Module

As we investigate the effects of attention for our problem we looked at another attention module called Convolutional Block Attention Module (CBAM) , proposed in [woo2018cbam]

, for feed-forward convolutional neural networks. Given an intermediate feature map, the attentonal module sequentially infers attention maps along two separate dimensions, channel and spatial. The attention maps (

) are then multiplied to the input feature map () for adaptive feature refinement, summarized as:


where denotes element-wise multiplication and is the final refined output.

3 Experimental Setup

In this section, we will present our experiments. First, we will describe our dataset and how it was created. Second, we will give our experimental plan and evaluation metrics used to measure the effectiveness of our proposed framework for wind flow prediction. Finally, we define all optimization and training details.

3.1 Datasets

To explore our proposed prediction architecture’s generality for CFD airflow simulations, we test the method on various datasets with different complexities. The datasets used consists of image pairs of building geometries and CFD simulations. The 3D problem of simulating flow fields for 3-dimensional buildings are translated to a 2D problem, where we see the buildings from above. More specifically, the CFD simulation images show the magnitude value of each cell’s velocity vector in the flow field. Each datasets can then be formalized as , having , and where and . To simplify the problem, we bucketize the CFD result into different velocity values. The following list details all datasets used in our experiments:

  • Wall - : The dataset contains geometries of walls, with respective CFD simulations. The geometries have different center-offsets , in addition to a angle in relation to the wind inlet direction. The dimensions of our input and output are .

  • Single building - : The dataset contains geometries of single buildings, with a fixed height. The geometries have the same parameterizations as , in addition to varying length and width. The input and output dimensions are equal to .

  • Two buildings - : The dataset contains geometries of two buildings, with a fixed height. Each building , in each geometry, have different center offsets , while having the same angle to the wind direction. This positioning is due to nearby buildings often is placed symmetrically. The height and width are also varying between the buildings in the same geometry. The input and output dimensions are equal to and .

  • Two buildings with varying height - : The dataset contains geometries with the same parameterizations as , except the height of the buildings are provided as an additional channel. As we now have two scales, height, and magnitude of the airflow velocity, we need to distinguish between them. Therefore, we input the geometries with an additional channel for building height and single dimension output, only caring about the velocity magnitude, which is the desired target. The input- and output dimensions are and , respectively. See Figure 10 for visualization of construction of model input.

  • Real urban city environment - : The dataset contains 287 geometries from the city center of Oslo, Norway. Compared to the geometries in , these geometries are more complex in shape and allows single buildings to have multiple height values. The dataset was generated by performing simulations on 600 patches of Oslo and using a 300 cropped centered circle for the training data. The resulting simulations’ flow fields have a diameter of 300 meters, with a maximum building height of 130 meters, and contains wind velocities up to 15 m/s. They represent actual buildings from the more urban and built-up areas of Norway. The flow field is encapsulated as a circle to allow rotation of geometries and calculation of comfort maps. The geometries are represented equivalent to , having the same input- and output dimensions.

Examples from each dataset are visualized in Figure 9.

Figure 9: Example image pair samples from each dataset .
Figure 10: Construction of model input from .

3.2 Generation of training data

The training data for the learning is generated using CFD simulations. The simulations for the single building and two buildings are performed in the commercial CFD software Simcenter STAR-CCM+ from Siemens PLM Software. The urban city environment is simulated using the open-source software OpenFOAM v7.

The chosen model solves the incompressible,
three-dimensional, steady Navier-Stokes equations governing fluid flow, using the finite volume method on an unstructured grid. An example of the computational grid used for the simulations is shown in Figure 11. The simulation setup is based on best practice guidelines for CFD simulations of urban flows [franke2007cost, tominaga2008aij]. The turbulence model used is the realizable k-epsilon model.

For the simulations of single and dual buildings, a geometry model with one or two buildings is automatically generated with varying dimensions, origin and orientation. The full 3D velocity field is then obtained from the CFD simulation, and the velocity magnitude in a slice above the ground is extracted for the training data. For the urban city environment, more information on the OpenFOAM simulation setup can be found in [hagbo].

Figure 11: Example of 3D mesh around a building used for the flow simulation.

3.3 Experimental plan

To investigate the models’ generality for CFD prediction, we evaluate the method on various datasets with different complexity, listed above. All experiments are performed using both Pix2Pix, CycleGAN, and UNet, training for a total of 70 epochs using

of .

3.3.1 Experiments

We perform experiments on all datasets,
, separately. Qualitative results are shown in Figures [12, 13, 15, 16, 17], while quantitative measurements are listed in Tables [1, 2, 3, 4, 5, 6, 7, 8]. Training time for the proposed method on each dataset are listed in Table 2. On all our datasets, training can be very fast. For example, the results shown in Figure 12 took less than 2 hours, per model, of training on a Nvidia Tesla V100 32 GB. At inference time, the models performs a forward pass in well under a second.

3.3.2 Investigating generalization

To investigate how well the models generalize the mapping function of building geometry to the CFD flow field, we perform experiments on training on more complex data and evaluating on a simpler one, and vice versa. Furthermore, this experimental design was employed because we want to examine whether the model can detect multiple buildings even though it has only been trained on single buildings. Occlusion of buildings would be a scenario we would want to investigate here. Besides, we want to see if a model trained on two buildings can generalize the mapping function well enough to predict single buildings’ airflow. We have executed the two following experiments;

(a) Optimizing model on single buildings (), evaluating on two buildings (). This will allow us to see if the model is able to generalize the domain transfer function to a more complex scenario.

(b) Training on two buildings (), evaluating on single building (). This experiment will measure how well the method can generalize to a simpler task, a single building CFD simulation.

3.3.3 Stabilizing GANs

As mentioned in subsection 2.4, one of the most significant challenges of training GANs is the lack of stability when updating the generator and discriminator’s weights. Spectral normalization is used to maintain this stability. By constraining the Lipschitz constant of the discriminator to be less than one, the training process should be more stable. Considering the above, we would like to explore what effect spectral normalization has on Pix2Pix and investigate the consequences of keeping the generator and discriminator at a similar skill level throughout training.

Spectral normalization is applied to each layer of the discriminator, and the implementation details are as thoroughly described in [miyato2018spectral].

3.3.4 SDF and CoordConv

In subsection 2.3 we propose to augment the introduced architectures by injecting positional information through extra channels. This information could help the network determine what parts of the input are essential to predict more accurately. Therefore, we would like to perform quantitative analysis to explore the effect of SDF and CoordConv on three different neural networks - Pix2Pix, CycleGAN, and U-Net.

3.3.5 Attention

Suppose a model is attending to more critical parts of an input image. In that case, it strongly suggests that the attention mechanism forces the model to focus more on the details in these specific regions. Given our problem, essential areas could be the wake area, immediately behind a building, or turbulent flows around building corners. Based on this, we want to explore self-attention and CBAM, two attention mechanisms, as described in subsection 2.5. The experiments will conduct an ablation study of the attention mechanisms in Pix2Pix’s generator and discriminator.

The implementation details for self-attention and CBAM are described in [zhang2019selfattention, woo2018cbam]. For both attention mechanisms, attention is implemented in the deconvolutional blocks of the UNet generator and is not present during the downsampling. On the other hand, the discriminator only applies attention in the 2nd and 3rd convolutional blocks of the network.

3.3.6 Pedestrian comfort in urban areas

Compared to real case scenarios from an urban city environment, the datasets - could be considered simple. Hence, as our last experiment, we would like to explore how our models perform on a much more complex dataset, . The dataset is, as mentioned earlier, generated from parts of the most built-up areas in Oslo. Therefore, it provides a more realistic scenario for designing an interactive tool for prototyping and city planning.

3.4 Evaluation Details

For evaluations done on dataset , we use a test set containing 20% of the images for the final assessment. We include a set of metrics for evaluating our models’ predictions against the physics solver solution. We denote , and represents the ith pixel intensity in the target simulation and the predicted flow field.

MAE The MAE is calculated for all predicted images produced by the models. The metric is widely used to quantify the difference between predicted and true values in accuracy validation for a model. The lower the MAE score, the better the model is at recreating the corresponding flow field, given a building geometry. The metric is used in [supercritical-cfd] for evaluating CFD airflow predictions. The metric can be formalized as


RMSE We calculate the RMSE of all predicted values from the real simulated flow fields. This score provides us a squared measure of how well our model can recreate each pixel value. RMSE has the benefit of penalizing large errors more than smaller errors. For example, an error of 10 would be penalized more than twice as much as a residual of 5. The metric is widely used and used to quantify CFD prediction errors in [wang2020physicsinformed]. We can mathematically formalize this metric as


MRE This metric allows us to quantify the relative error of velocity magnitude between a model’s predictions and the physics solver solution of the flow field. The metric is scale and range-invariant. Therefore, it can be seen as a better indicator of the quality of a prediction. The metric is commonly used to quantify relative residuals in accuracy validation, and is used by [2020cfdnet, Guo2016ConvolutionalNN, Thuerey_2020, flowgan, Bhatnagar_2019] and can be found in the CFD literature [cfd-literature]. MRE is defined as


These three metrics are the evaluation metrics we apply to our predicted wind flows. To further analyze where our predictions have the most significant errors, we take random samples from the test set and investigate the absolute pixel difference between the simulations and predictions.

3.4.1 Models

We compare three state-of-the-art models for image-to-image translation on CFD airflow prediction.

  1. Pix2Pix [isola2018imagetoimage]. For each iteration, we update both the generator and the discriminator. The generator used has a U-Net architecture [ronneberger2015unet], with downsamplings in the U-Net, resulting in an output resolution of . For our problem, the input and output might differ in surface appearance, but both consist of the same underlying structure. Therefore, we may say that their structure is roughly aligned. The generator is designed around these considerations, having skip-connections between each down- and up-sampling layer. This way, we circumvent the bottom bottleneck layer; low-level information is passed with the aforementioned skip connections. Each skip connection concatenates all channels at layer with those at layer , having

    as the total number of layers. We use the leaky rectified linear unit (LeakyReLU) and instance norm and rectified linear unit (ReLU) in each upsample block in each U-Net downsample block. Using instance normalization has been demonstrated to be effective at image generation tasks

    [ulyanov2017instance]. To reduce the chance of overfitting, the implementation also uses dropout [stava2014dropout]. The generator itself has parameters. The discriminator consists of a PatchGAN, which tries to classify each patch in the input-image as real or fake. It consists of 5 convolutional layers, interleaved by leaky ReLU activations and instance normalization. This discriminator is applied convolutionally across the image, averaging all responses to provide the final output of . The discriminator has trainable parameters, giving us a total of parameters for the proposed network architecture.

  2. CycleGAN [zhu2020unpaired] consists of two generators with nine residual blocks and two deconvolutional layers, intertwined by ReLU, dropout, and instance normalization. The model has two discriminators, each with five convolutional layers with leaky ReLU and instance normalization. Each generator has a total of parameters, and each discriminator has parameters, which yields a total of parameters. The model is optimized using subsubsection 2.2.3 and subsubsection 2.2.3. Details about architecture and optimization are detailed in subsubsection 2.2.3.

  3. U-Net [ronneberger2015unet]: An autoencoder-like architecture for image translation. The architecture has the same specifications as described for the U-Net generator in model (1). The autoencoder has a total of trainable parameters, as it is identical to the generator of the Pix2Pix model. It is optimized using L1 distance to the target simulation.

The U-Net architecture used in models (1) and (3) is described in detail in subsubsection 2.2.2.

3.5 Optimization and training details

All model are optimized with the Adam solver [kingma2017adam] with a batch size of 1, a learning rate of , and the momentum parameters , . We keep the same learning rate for the first 50 epochs and linearly decay the rate to zero over the next 20 epochs.

While optimizing our networks, we update both the generator and discriminator for each training iteration. The Pix2pix model is optimized end to end with respect to the objective in Equation 3 with , CycleGAN with respect to subsubsection 2.2.3 with , and U-Net with the L1 loss. With CycleGAN, we also use a pool size of 50. As suggested in [isola2018imagetoimage], we divide the discriminator loss in half to slow down the learning rate relative to G.

All models are trained on an Nvidia Tesla V100 GPU 32 GB using NTNUs computing cluster IDUN [idun]. See Table 2 for each baseline’s training time.

At inference time, we use the generator, in evaluation mode, without dropout and instance normalization, as opposed to [isola2018imagetoimage]. For our given problem, we want the model to be deterministic concerning the conditional output.

4 Results & Discussion

This section will present our results and discussion. Firstly, we will analyze the overall results for the different architectures on the datasets. Then, we will explore the impact of adding spectral normalization, attention and additional positional information to the input data. Finally, we evaluate our models on a more complex dataset generated from actual buildings in the city of Oslo and discuss how our networks would work in an interactive tool for wind flow predictions.

4.1 Neural networks for wind flow prediction

Table 1 compares Pix2Pix, CycleGAN and U-Net in terms of MAE, RMSE and MRE. We visualize randomly selected predictions from the test set of in Figure 12. For more samples from , see Figures 21, 22 and 23, in [.

Figure 12: Test sample, from , with predictions from all models.

To quantify our findings, we have listed all metrics for all models, evaluated on , in Table 1. We see that for all datasets, U-Net yields lower residuals than its opponents Pix2Pix and CycleGAN. More qualitatively, in terms of MRE for and , Pix2Pix performs and worse than U-Net, respectively. The CycleGAN model performs over worse than U-Net on . We suspect a reason for this might be that the U-Net model is only optimized using L1-loss. As we can see in Figure 12 its predictions are more continuous than Pix2Pix and CycleGAN; hence its residuals would be lower than if it would enforce only the 20 possible velocity values present in the actual simulation. Also, we see that Pix2Pix outperforms the CycleGAN architecture on this task. This difference could be due to CycleGANs additional objectives of the additional mapping between prediction and geometry and a cycle consistency loss. These objectives are not necessary for the given task of CFD prediction and could be why it performs worse than the other models.

Model architecture
Dataset Metric Pix2Pix CycleGAN U-Net
MAE 0.0139 .0004 0.1651 .0151 0.0090 .0001
RMSE 0.0329 .0009 0.2642 .0105 0.0290 .0005
MRE 0.1261 .0002 1.1806 .1452 0.0841 .0006
MAE 0.0482 .0014 0.0944 .0253 0.0345 .0003
RMSE 0.0828 .0014 0.1795 .0773 0.0701 .0007
MRE 0.2553 .0080 0.4785 .0831 0.1941 .0010
MAE 0.0554 .0009 0.1022 .0048 0.0438 .0002
RMSE 0.0971 .0013 0.1678 .0041 0.0847 .0003
MRE 0.2889 .0053 0.5612 .0260 0.2400 .0017
Table 1: Evaluation metrics for model and baselines on datasets .
Figure 13: Absolute difference between simulation and predictions from proposed method and baselines, sampled from .

To investigate where our models’ predicted image performs best, we illustrate in Figure 13 the absolute difference between airflow simulation and the predicted flow fields. The mean absolute residuals are , , for Pix2Pix, CycleGAN, and U-Net, respectively. We see that CycleGAN performs much worse than the other two models throughout the entire flow field. To compare Pix2Pix and U-Net, we see that U-Net has a smaller average residual. Also, we see that Pix2Pix has residuals around where the airflow velocity magnitude changes bin value, which is less present for the U-Net as its prediction seems to be more continuous than the GAN-architectures’ predictions.

Model Time per epoch # of parameters (M)
Pix2Pix 60 sec 57.1
CycleGAN 236 sec 28.3
U-Net 75 sec 54.4
Table 2: Training time per epoch, on , for all models, including the number of parameters.

All models generate predictions in well under a second at inference time, but training time and model sizes vary. In Table 2 we see that Pix2Pix and U-Net are both similar in training time and model size, while the CycleGAN consisting of two generators and discriminators have a longer training time, of 236 seconds per epoch, but is again smaller in size. The size difference also might be an explanation of why the models perform differently. Both the Pix2Pix and U-Net architectures rely heavily on the U-Net architecture, with skip-connections to keep low-level information between the down- and up-sampling layers. These skip-connections are not present in the CycleGAN and might be a factor in this model’s reduced performance.

We see that during training, for the validation set, Pix2Pix and U-Net slowly converge to an MAE and MRE close to each other, while CycleGAN’s residuals are more fluctuant, variant, and less stable throughout the training. For the training loss, see Figure 25, we see that the generator and L1 losses converge rather quickly, while the discriminator losses are lessened throughout the whole training period and yield a high discriminator accuracy.

4.2 The impact of Spectral Normalization

As described in subsection 2.4

, a persisting challenge in the training of GANs is the performance control of the discriminator [24]. In the initial phase of hyperparameter tuning, we saw it was hard to find a good ratio between updating the generator and the discriminator. As a result, our discriminator could almost perfectly distinguish the target model distribution early in the training process, which essentially stopped the GAN from learning more. GANs can use spectral normalization to handle problems like this, and in

Table 3, we compare the results when training a Pix2Pix model with and without spectral normalization. In the model utilizing spectral normalization, the normalization technique is applied at every layer in the discriminator. From the result table, we see a relatively significant decrease in overall metrics when evaluated on . On , which is the most complex of them, we see a 16%, 9%, and 10% drop in error for MAE, RMSE, and MRE, respectively.

Dataset Metric Pix2Pix Pix2Pix w/SN Improvment (%)
MAE 0.0139 .0004 0.0099 .0005 28.78 5.5
RMSE 0.0329 .0009 0.0256 .0015 22.19 7.6
MRE 0.1261 .0002 0.0915 .0040 27.44 3.3
MAE 0.0482 .0014 0.0384 .0007 20.33 3.7
RMSE 0.0828 .0014 0.0735 .0009 11.23 2.5
MRE 0.2553 .0080 0.2178 .0028 14.69 3.7
MAE 0.0554 .0009 0.0464 .0004 16.25 2.1
RMSE 0.0971 .0013 0.0882 .0004 9.17 1.6
MRE 0.2889 .0053 0.2578 .0033 10.76 2.7
Table 3: Evaluation metrics for the study of spectral normalization in Pix2Pix on . Last column shows the improvement in percentage when using spectral normalization in the discriminator.
(a) Pix2Pix wo/SN
(b) Pix2Pix w/SN
Figure 14: Discriminator accuracy over training epochs where (a) is a Pix2Pix model without spectral normalization in the discriminator, and (b) is a Pix2Pix model with spectral normalization in the discriminator.

Pix2Pix uses PatchGAN as a discriminator, as explained in subsubsection 2.2.2. The discriminator predicts for each patch if it thinks the patch is real or fake. To calculate the accuracy of the PatchGAN, we calculate the average prediction of all patches. When the average is over 0.5, we consider it as predicting the image to be real. The two graphs in Figure 14 visualize the discriminator’s accuracy during training. This metric could be a good indication of how good the discriminator is compared to the generator. As we know, a discriminator who perfectly distinguishes the target model distribution is not learning anything new. Comparing the two graphs, we observe that the accuracy of the Pix2Pix model with spectral normalization holds a lower accuracy during training than the one without spectral normalization. We believe this gives better feedback for the generator and that this is the reason for the sudden drop in error overall metrics as shown in Table 3.

4.3 The impact of SDF and CoordConv

To investigate the effect of SDF and CoordConv, both providing positional input to the model. We evaluated the different models on with these features implemented both separately and combined. In Table 4 we see that the results are quite close to each other; however, the models having an additional channel with normalized SDF for the geometry performs better than the vanilla architecture.

Figure 15: Absolute difference between simulation and predictions from Pix2Pix with variations of positional features in experiment, from

Using CoordConv for the first convolutional layer in both the generator and discriminator also yields a lower residual average than the vanilla model. When combining both SDF and CoordConv, we see that the residuals yielded are lower than applied separately. More qualitatively, if we investigate the yielded MAEs, we see that the positional information injected results in a significant performance improvement. For CycleGAN, we see an MAE reduction of for SDF and for CoordConv; combining them both, we see a gain of . This trend follows for both Pix2Pix and U-Net, and by injecting the two types of positional information, we see an improvement of and , respectively.

We can conclude that these methods positively affect the architectures in predicting CFD airflow velocities and that positional information may be helpful for RANS prediction.

Positional information
Model Metric None SDF CoordConv SDF & CoordConv
Pix2Pix MAE 0.0554 .0009 0.0522 .0011 0.0524 .0001 0.0514 .0002
RMSE 0.0971 .0013 0.0928 .0009 0.0949 .0004 0.0918 .0006
MRE 0.2889 .0053 0.2719 .0048 0.2799 .0010 0.2709 .0029
Pix2Pix w/SN MAE 0.0464 .0004 0.0449 .0003 0.0458 .0001 0.0451 .0008
RMSE 0.0882 .0004 0.0870 .0004 0.0883 .0001 0.0868 .0010
MRE 0.2578 .0033 0.2512 .0032 0.2604 .0027 0.2524 .0046
CycleGAN MAE 0.1022 .0048 0.0957 .0240 0.0856 .0108 0.0800 .0263
RMSE 0.1678 .0041 0.1560 .0289 0.1491 .0163 0.1430 .0409
MRE 0.5612 .0260 0.5528 .1374 0.4822 .0563 0.4581 .1415
U-Net MAE 0.0438 .0002 0.0427 .0002 0.0440 .0004 0.0422 .0003
RMSE 0.0847 .0003 0.0829 .0004 0.0850 .0006 0.0821 .0004
MRE 0.2400 .0017 0.2322 .0024 0.2434 .0034 0.2298 .0036
Table 4: Average evaluating metrics, on test set, for experimental study of the effects of SDF and positional coordinate channels (CoordConv) for predicting CFD airflow. Experiment is evaluated on .

4.4 The influence of attention

We have shown that both spectral normalization and embedding positional information positively affect our models in earlier sections. Hence, we bring these features with us further in our analysis when we experiment with attention in this section. The results when applying self-attention and CBAM are presented in Table 5. Here we test with both attention in the generator and the discriminator, as well as separately. We also perform one additional experiment for each of the methods. In this experiment, we take the best result for each of the attention mechanisms and embed the positional information. We embed both CoordConv and SDF as this has shown to give the best results in earlier experiments. All the experiments are executed using Pix2Pix with spectral normalization on .

From the results in Table 5 we see that adding attention in the discriminator only has a negative effect both when using self-attention and CBAM. It is hard to say why the attention maps seem to disturb the discriminator in evaluating the predictions. Still, one could argue that all information given to the discriminator is essential, and trying to filter out the less critical parts work against its purpose in this case. On the contrary, attention in the generator seems to be working better and gives us similar results to what we got earlier when embedding the positional information. We then tried to combine attention in the generator and discriminator. This combination did not positively affect the predictions, making sense based on how the attention mechanism affected the discriminator. From this, we concluded that adding the attention mechanism to just the generator was the best way.

For the final experiment with attention, we embedded positional information together with the attention mechanisms. The addition of positional information did also have a positive effect when combined with attention. If we compare the two attention mechanisms, we see that the best results are given when using CBAM, showing a lower error on all metrics and being more stable during training. Given that we embedded the positional information in different input channels, we believe CBAM did the best because it infers the attention maps along two separate dimensions, both channel- and spatial-wise.

When we compare these results with the ones previously given on the same dataset without attention in Table 4 we see a slight drop in error. More specifically, a 3.4% decrease in MAE, a 2.0% reduction in RMSE, and a 2.1% decrease in MRE, when applying the CBAM attention mechanism in addition to spectral normalization, CoordConv, and SDF on the Pix2Pix model, when comparing to the same model without CBAM. We conclude that adding attention could help the model make better predictions for simple geometries such as .

0.0575 .0056 0.0998 .0084 0.3516 .0501
0.0464 .0019 0.0887 .0018 0.2568 .0103
0.0587 .0083 0.1028 .0106 0.3660 .0773
, Coord & SDF 0.0446 .0017 0.0873 .0022 0.2518 .0076
0.0551 .0018 0.0984 .0022 0.3053 .0134
0.0468 .0008 0.0872 .0010 0.2614 .0029
0.0527 .0013 0.0954 .0011 0.2931 .0099
, Coord & SDF 0.0436 .0006 0.0851 .0006 0.2472 .0023
Table 5: Evaluation metrics for Pix2Pix with spectral normalization on . We evaluate results both using self-attention and CBAM. We examine the results when applying the different attention mechanisms to the generator (), discriminator(), and both (). For the method that performs the best, we also include an experiment with the positional information given.

4.5 Experiment on buildings with varying height

To evaluate how predicting CFD would perform for buildings with varying heights, we have experimented with the different proposed architectures on the dataset. The resulting metrics, describing the prediction-simulation residuals, can be found in Table 6. As we can see, the models and the input sequence can be described to enable the model to generalize the target mapping between geometry and flow fields for buildings with varying heights. If we compare the results in Table 6 and Table 1, we see that the metrics are quite similar to the experiments on . All models can yield satisfactory residuals, leading us to believe that the models can generalize the target mapping for a more complex dataset with buildings of different heights.

Pix2Pix 0.0450 .0005 0.0821 .0006 0.1045 .0014
CycleGAN 0.0813 .0045 0.1329 .0067 0.2105 .0195
U-Net 0.0390 .0001 0.0760 .0003 0.0922 .0006
Table 6: Evaluating metrics for the experiment on predicting CFD airflow for buildings with varying heights ().

Looking at Figure 16, we see that the models can capture the different building heights. Furthermore, we see that the buildings’ velocity is more significant for the taller buildings, which is what one would expect.

Figure 16: Predictions from models trained on buildings of varying height.

4.6 Generalization between data of varying complexity

We wanted to investigate how well our models can generalize predicting airflow velocities around building geometries of different complexity from what the models have been trained. Samples from the generalization experiment are found in Figure 17. We can see that both the Pix2Pix- and U-Net models can somewhat predict the airflow for single buildings, even though they have been trained on two buildings. The results in Table 7

show the mean absolute error and standard deviation from test sets of 20 geometry simulation pairs. In

Table 7 we see, for the experiment (b), that U-Net performs marginally better than Pix2Pix for this generalization task, while CycleGAN yields the highest residual.

Pix2Pix CycleGAN U-Net
0.1093 .031 0.1137 .032 0.1115 .032
0.0897 .026 0.1338 .009 0.0861 .026
Table 7: Mean average error with standard deviation for generalization experiment (a) and (b), and respectively.

For the other task (a), i.e., predicting airflow for two buildings while optimized on single buildings, we see that Pix2Pix has a smaller absolute residual than CycleGAN and U-Net. For this task, CycleGAN can perform well and has a marginally lower absolute difference than U-Net.

Figure 17: Predictions for models trained on different dataset complexity than evaluated on. denotes a model trained on . (a) Pix2Pix, U-Net and CycleGAN trained on evaluated on , and (b) vice versa.

4.7 Interactive tool for wind flow assessment in urban areas

We perform experiments on the more complex dataset , on the Pix2Pix model. We want to determine whether or not a neural network-based architecture, trained in a data-driven manner, is capable of generating accurate enough wind flow predictions for an interactive tool for city planning. As introduced, was generated by performing simulations on multiple 600 patches of Oslo, Norway, keeping a 300 centered crop. In addition to fewer examples, this problem presents a much more realistic scenario and increases the complexity drastically from .

Table 8 displays the results from our experiments on . We observe that the injection of positional information does not impact the results as significant as before. There is no clear pattern showing that adding the positional information improves the predictions for the more complex scenario. If there is an improvement, it is not that significant. The conditional input contains drastically more building-area, which are both more complex and scattered than before. This strongly suggests that the positional information could be less informative than before.

Additionally, we have done experiments with both attention and spectral normalization. As before, we see an improvement when applying spectral normalization to the discriminator, which is expected to benefit from stable training. When the attention mechanism is applied, we observe a slight increase in error. This decline in performance could indicate that attention does not necessarily facilitate or improve the model when the building geometries get more complicated. When comparing the most accurate Pix2Pix model with U-Net, we observe that Pix2Pix is unable to yield lower residuals than U-Net. While the difference in performance might be insignificant, the training process for U-Net is more straightforward than the training cycle for GANs, with fewer parameters as it does not include a discriminator. This could suggest that the GAN architecture might not necessarily be the best architecture for this problem.

One of the most crucial things when building an interactive tool is that the predictions are fast and accurate enough. A benefit of using neural networks for this is that pre-trained models can be saved, loaded, and served on a server and produce predictions in a matter of seconds.

Figure 18: Comparison of calculated comfort maps from simulations and predictions using Pix2Pix w/SN and CoordConv.

A pedestrian wind comfort map illustrates the pedestrians’ annual wind conditions at ground level at a site; see Figure 18 for an example. To produce these maps, you combine wind simulations, statistical weather data, and a set of defined comfort criteria [teigen2020influence]. The comfort map shown in Figure 18, is computed by producing wind predictions from eight different wind inlet directions. Each pixel is then categorized using weather data for the specific location in the form of a wind rose and some comfort criteria. As in Lawsons wind comfort criteria [wind-criteria], we have five classes - sitting, standing, strolling, business walking, and uncomfortable. Each class has its wind speed range and is represented in Figure 18. As comfort maps need wind velocities for at least eight different wind inlet directions, sometimes up to 36, it supports our claim that the tool would gain from fast predicting models.

Figure 19: Modified wind rose at 10 m above ground level, Oslo

Figure 19 presents the wind rose used in the production of the maps in Figures 18 and 24. Figure 24 displays examples from

with predictions and comfort maps. The wind rose is calculated by interpolating historical hourly wind statistics from the last five years from nearby areas, given a flow field’s geographical coordinate, but have been altered to have stronger winds. This is done to enable the usage of all comfort classes, allowing comparison of the maps. The wind rose is divided into eight wind directions and displays the frequencies of different wind velocities from all directions.

The cloud solution shown in Figure 20 predicts comfort maps through an interactive map. The tool performs eight predictions for the selected area — one prediction for each of the eight wind directions in the windrose above. To predict the wind flow from different wind directions, we rotate the conditioning input, as the models are trained on simulations where the wind-inlet always comes from the left. While the application is in an early stage, it has demonstrated substantial promise for an interactive tool capable of delivering accurate predictions for CFD analysis in an urban city environment.

Figure 20: Cloud solution for predicting CFD based comfort maps.
Positional information
Model Metric None SDF CoordConv SDF & CoordConv
Pix2Pix MAE 0.0732 .0004 0.0749 .0019 0.0732 .0006 0.0746 .0009
RMSE 0.1389 .0007 0.1412 .0053 0.1380 .0015 0.1405 .0031
MRE 0.1280 .0011 0.1316 .0018 0.1291 .0019 0.1317 .0011
Pix2Pix w/SN MAE 0.0692 .0006 0.0707 .0002 0.0691 .0005 0.0708 .0004
RMSE 0.1304 .0010 0.1344 .0009 0.1304 .0006 0.1343 .0019
MRE 0.1224 .0011 0.1260 .0005 0.1228 .0016 0.1266 .0008
Pix2Pix w/SN & CBAM MAE 0.0727 .0011 0.0777 .0066 0.0717 .0005 0.0771 .0019
RMSE 0.1353 .0011 0.1425 .0118 0.1333 .0005 0.1396 .0042
MRE 0.1298 .0022 0.1428 .0142 0.1284 .0006 0.1442 .0042
CycleGAN MAE 0.2746 .2657 0.0962 .0080 0.6507 .0444 0.1101 .0110
RMSE 0.4127 .3762 0.1796 .0177 0.9720 .0430 0.2027 .0148
MRE 0.4146 .3016 0.1724 .0157 0.8415 .0697 0.2069 .0237
U-Net MAE 0.0673 .0002 0.0682 .0004 0.0669 .0002 0.0679 .0006
RMSE 0.1272 .0001 0.1290 .0008 0.1268 .0001 0.1287 .0007
MRE 0.1200 .0003 0.1241 .0014 0.1202 .0007 0.1237 .0021
Table 8: Average evaluating metrics, on test set. Experiment is evaluated on .

5 Conclusion

We investigated different adversarial networks, Pix2Pix and CycleGAN, along with a U-Net autoencoder to perform image-to-image translation between conditional geometries of buildings to their corresponding wind flows. The presented results show that the models can produce realistic outputs conditioning on the input for all the different datasets. Also, the models made predictions in a significantly shorter time than traditional CFD methods. Furthermore, our experimental study of injecting positional information about the buildings to the model showed that SDF and CoordConv can help the network make accurate predictions. More precisely, by combining both, we got a performance improvement of , and , for Pix2Pix, CycleGAN, and U-Net on . Observing this, we can conclude that the injection of positional information can benefit the airflow prediction task. Our results have also demonstrated a 10% and 4% drop in MRE on and the most complex dataset , respectively, when applying spectral normalization to stabilize training. Moreover, models implementing attention scored better than the ones without it.

We cannot conclude that GAN are better fitted for this domain than other kinds of neural networks as we saw promising results when experimenting with U-Net. While the performances are almost equivalent, the training process for GANs are more complex than the one of U-Net as it involves a second network. This fact could suggest that the GAN might not necessarily be the best-fitted architecture for this problem.

It is hard to say whether or not the model can learn the underlying Reynolds-average Navier-Stokes equations. Still, by looking at the absolute difference plots, we can observe that areas downstream of the buildings are the areas with the highest error. Errors in these areas indicate that our models might not be accurate for giving final simulations in the most critical areas. However, our experiments on the urban city environment showed that we could use a GAN as the underlying model for an interactive design tool. We consider the results accurate enough, especially when the goal is to produce comfort maps that classify the velocities as in Lawsons wind comfort criteria, which are more or equally accurate than airflow velocity predictions due to scaling, averaging and binning of the velocities.

5.1 Further Work

CFD lets us solve the governing equations for fluid dynamics for complex engineering problems. CFD is today used in a wide range of industries; some examples are air resistance for airplanes and cars, wind and wave loads on buildings and marine structures, and heat- and mass transfer in chemical processing plants. These simulations can provide a detailed understanding of the fluid flow, but the simulations are complex and computationally costly. This complexity issue currently makes processes like generative design and optimization complex and interactive design impossible. This thesis rephrased the problem from computing 3D flows fields using CFD to a 2D image-to-image translation-based problem. Another approach is to compute the aerodynamic forces on a given geometry. These 3D geometries are often fed into the CFD software via a surface triangulation encoded in .STL or .OBJ file formats. These file formats are supported by many software packages and are widely used for rapid prototyping. An approach like this would require some modifications to the underlying models performing the wind predictions.

Since the simpler U-Net model trained in a supervised way scored better on several of the metrics listed, further work should look into other architectures in addition to GANs. An architecture that has grown in popularity in the last couple of years is GNN. Deepmind showed in some of their latest work how they learn to simulate complex physics with graph networks [sanchezgonzalez2020learning] in various physical domains like fluid dynamics. Incorporating more of the physical equations into the methods could help optimize the deep learning model, verify the results, and perform uncertainty estimation of the generated output.

Furthermore, solving more complex inputs like whole cities, similar to , probably requires a different approach than conditioning on the entire geometry at once. One opportunity could be to iterate over the prediction area in a more hierarchical way, where the geometries condition on slices earlier in the flow field of the city.


Appendix A Figures and Tables

Figure 21: Samples from test set, with predictions from all models
Figure 22: Samples from test set, with predictions from all models
Figure 23: Sample from test set, with predictions from all models
Figure 24: Predictions from Pix2Pix on , along calculated comfort maps from simulated and predicted flow fields. Absolute pixel difference maps uses same color scheme as Figure 13
Figure 25: Training losses for generator (a), discriminator (b), L1-loss (c), for U-Net, on