## I Introduction

In this work, we present our results from participating in the Traffic4Cast Challenge 2021. The competition aims to predict the development of traffic volume and speed up to one hour into the future. This task is challenging due to the stochastic nature of moving cars and complex spatio-temporal dependencies. The Traffic4Cast challenge in this year was subdivided into two challenges with a focus on temporal and spatial transfer. The exact settings are described in detail in Sec. II.

Our approach to this problem is inspired by the well-known U-Net architecture (ronnebergerUNetConvolutionalNetworks2015), as used frequently in the scope of the competition. Although U-Net models have shown to perform well, transfer to unseen cities has been difficult for such convolution-based approaches. pmlr-v123-martin20a

provided empirical evidence that graph-based models generalize better to unseen cities as these allow to leverage prior knowledge about the street network. Thus, instead of relying on visual convolutions (CNN), we apply graph neural networks (GNN) to integrate local traffic information using a road graph. A drawback of using GNN-based approaches is limited control over the receptive field. Expansion of the receptive field onto larger areas of the graph requires deeper models, however, it has been shown by

DeepGNNsPerformanceDecay that performance of GNNs drops at a certain depth. We therefore adapted the visual pooling and upsampling operation in a way that they can account for long-range spatial relations between different areas in the road graph. To enable reproducibility, the code will be made publicly available^{1}

^{1}1https://github.com/LucaHermes/graph-U-Net-traffic-prediction.

## Ii Challenge Details and Data

The traffic data is given as a two-dimensional heat-map-like image representation of size pixels with eight channels. The channels of the pixel values correspond to directional traffic speed and volume information binned into four discrete directions (north-east, south-east, south-west, and north-west). The data was sampled in a five minute interval and was gathered from a fleet of probe vehicles. The measurements are mapped onto a pixel grid using GPS information to represent traffic movies as shown in Fig. 1. Data was collected over two years (2019 and 2020) in ten different cities around the world. The traffic situation between 2019 and 2020 is subject to temporal shift caused by the ongoing COVID-19 pandemic which poses a particular challenge for transfer of models. In addition to the traffic movies, a road graph was generated from high-resolution images of the respective street network. The nodes in the graph correspond to pixels that belong to a street and an edge exists between two nodes if the corresponding pixels are adjacent and belong to a street. Furthermore, static street map images with a resolution similar to the traffic movies are also included for each city. These street maps are one-channel images, where the pixel intensities roughly correspond to street size.

Subset | 2019 | 2020 | Cities |
---|---|---|---|

C1 | train | train | Antwerp, Bangkok, Barcelona, Moscow |

C2 | train | - | Berlin, Chicaco, Istanbul, Melbourne |

C3 | - | test (core) | Berlin, Chicaco, Istanbul, Melbourne |

C4 | test (ext.) | test (ext.) | New York, Vienna |

This year’s challenge is divided into a core challenge and an extended challenge. The core challenge puts a focus on temporal generalization regarding the domain shift caused by COVID, while the extended challenge puts a focus on spatial generalization to unseen cities. Participants were invited to compete in both independently. Four different subsets are derived from the data (s. Tab. I) and define the two challenges. Specifically, both challenges are using the same training dataset (subsets C1 and C2), but different test sets. The evaluation of the core challenge focuses on generalization from pre-COVID training data of 2019, to the test set C3 recorded during COVID in 2020. The extended challenge focuses on generalization across cities. Model evaluation for this challenge uses the test set C4 that consists of data from two cities that were excluded from the training set. The task in both challenges is to predict the traffic 5, 10, 15, 30, 45, and 60 mins into the future.

## Iii Traffic Prediction

denotes the global state vector. Fig. (b) is inspired by Fig. 4 (a) by

battagliaRelationalInductiveBiases2018. Our changes to the original full GN Block are highlighted as colored elements. The operations are functions that can be learnt. Elements in (a) that contain the GNN Layer (b) are denoted by yellow boxes.Our architecture is based on the classical U-Net model (ronnebergerUNetConvolutionalNetworks2015), which originally relies on series of two-dimensional visual convolutions. Thus, intermediate feature maps depend not only on the traffic data but also on city-specific empty areas in the traffic movies. As the road graph provides more specific topological information than just a regular pixel grid, we generalize this model to graphs by applying GNN layers instead of visual convolutions (CNN). As a consequence, empty areas are excluded from computations. Thereby, these areas cannot affect the downstream latent representations, which we assume as beneficial for cross-city generalization.

In contrast to CNNs, GNNs cannot capture the geographical topology of a node neighborhood due to the permutation invariant accumulation functions. For example, a regular GNN cannot distinguish whether a neighboring node lies to the north or the south. Invariance to geographical topologies seems like a major drawback, as the traffic features are directional. We propose a straightforward method to mitigate this drawback and specialize a GNN to be sensitive to these geographical neighborhood topologies (s. Sec. III-B). As the edges in the road graph only ever connect nodes that are also adjacent in the image space, the graph diameter (maximum distance between pixels on the graph) is . A single-layer GNN can explore a 1-hop neighborhood, thus, long-range relations between nodes could only be exploited using a very deep GNN. However, it has been shown by DeepGNNsPerformanceDecay that common GNNs don’t scale well with model depth and tend to oversmooth in such cases. To still allow information exchange over the whole graph, we include a global state vector in our GNN layers. This vector can be understood as being adjacent to every node in the graph. Furthermore, we leverage the unique position in the 2D pixel grid of each node, to design down- and upsampling operations (s. Sec. III-C), and thus to expand the receptive field of the GNN. Following the U-Net schematic depicted in Fig. 1(a), we arrange these operations in a down- and an upsampling branch that are additionally connected via skip-connections. The GNNs in the downsampling branch consist of a single layer, whereas the GNNs in the upsampling branch consist of two layers. We will first describe the different types of features used in our layers and then the computations inside the layers.

### Iii-a Feature Generation

The input to our model is composed of the given road graph and the static street map images. The road graph is extended by edge feature vectors and one global feature vector . The node features correspond to the pixel values of the traffic movies, where the individual frames are concatenated into a single vector per node. Speed and volume information is scaled down to values between 0 and 1, i.e. divided by 255. To initialize edge features, a two-layer CNN generates a feature map with eight channels from the normalized static street map . This results in an additional set of node features and by concatenating sender and receiver nodes from yield the edge features

(1) | ||||

(2) |

where and are the sender and receiver node of edge , respectively, and denotes concatenation. The global state is computed by summing up the node features and scaling them by a constant . This factor has to be included, as the sum over all nodes can be large and scaling by showed empirically to produce suitable numbers. Next, time and weekday information is encoded in and , respectively, and concatenated to the global state vector. Time

is encoded as a 2D position on the unit circle, where the 24-hour interval corresponds to one full revolution and weekday is one-hot encoded.

(3) | ||||

(4) | ||||

(5) |

### Iii-B Graph Layer

As basis of our graph layer, we use the full GN block proposed by battagliaRelationalInductiveBiases2018, as shown in Fig. 1(b). It is composed of parameterized update functions that are applied to update node, edge, and global graph features, as well as unparameterized functions that accumulate sets. These functions are implemented as

(6) | ||||

(7) |

where denotes vector concatenation over all input vectors , is the weight matrix and

is the bias vector.

ReLU (Agarap2018DeepLU)is used as the activation function. Fig.

3 visualizes the computations conceptually.We adapt the full GN block to make the computations sensitive to the local neighborhood topology (adaptations highlighted in Fig. 1(b)). Specifically, we split the road graph into four subgraphs . As shown in Fig. 4 the subgraphs each contain edges directed into one of the four quadrants north-west, north-east, south-east and south-west. We choose these four subgraphs, as they reflect the partitioning of the directional traffic speed and volume information as given in the node features. Note that each subgraph uses the same node features. For each subgraph , separate edge transformations compute the updated edge features for each edge in graph . Then, the updated node features are computed by concatenating the edge features of the four subgraphs and for each node accumulating the incident edges. Finally, the global state vector is updated using the accumulated node and edge features, as well as the prior global state vector as inputs. The update functions are

(8) |

where and are the sender and receiver node of edge , respectively, and denotes the neighborhood of node . The outputs of the graph layer are the new state vectors for nodes , edges and the global features .

### Iii-C Downsampling and Upsampling

While down- and upsampling operations on graphs are not straightforward, in the current setting we can exploit the topological information that locates nodes in the pixel grid. Thus, existing visual pooling and upsampling methods can be adapted to work on the street graph. Our downsampling operation directly corresponds to a regular max-pooling with a kernel size and a stride of

. Specifically, we partition the set of nodes according to their position in the 2D grid in a way that each partition contains the nodes in a window and take the feature-wise maximum. Hence, each partition is condensed into a new node, resulting in a new set of nodes . An edge connects two nodes and , if any two nodes in the corresponding pooling windows were connected by an edge. If multiple such connections are present, the feature-wise maximum is taken to yield the new edge features. Finally, the new node position corresponds to the 2D index of the corresponding pooling window. The edge features are updated by taking the feature-wise maximum over all edges that are mapped onto the same edge .The upsampling operation relies on a given input graph and a target graph structure . Fig. 6 visualizes our upsampling method. First, zero-initialized nodes are introduced to the input graph, where denotes the number of nodes in . Second, the position of input nodes is scaled by a factor of two. Then, we partition the nodes by their position in the two dimensional space in a way that each partition contains the nodes in a window, like in the downsampling case. Next, edges are created by connecting nodes in to all nodes in that are in the same partition. This creates an upsampling graph as shown in Fig. 6 (middle), effectively connecting with . Our adapted GNN is applied to the upsampling graph to propagate information from to . This results in new node features for the upsampled graph. Note that the GNN uses the same edge partitioning as described in Sec. III-B. Hence, the operation is sensitive to the relative neighborhood topology and will therefore produce different values in the receiving nodes, even if the sending node is the same.

### Iii-D Training Setup

We train our model for 800k steps on the provided training data on a standard MSE loss using the ADAM optimizer (kingmaAdamMethodStochastic2017). The learning rate schedule contains 2k steps of warm-up. After warm-up, the learning rate is and decays from there exponentially at a rate of every 100 steps. The minimal learning rate is . At each step, we sample valid starting frames for the seed sequences (model input), which are frames that correspond to a time between 00:00 and 22:00 o’clock. We take the average over the gradient of 16 successive samples to update the model parameters. This effectively corresponds to taking a batch size of 16. We submitted the exact same model to both competitions. Hence, we consider the temporal generalization problem from the core challenge as a kind of spatial generalization problem as well. This is possible as data from 2020 is included in the training set, just for different cities than used in the evaluation.

## Iv Results

We evaluate our model using two different evaluation datasets. As the challenge test set is not openly available, we split the dataset and use a portion of the original training data only during model evaluation. The first evaluation set is a fraction of the training set (

) selecting specific points in time. Specifically, for each day in April 2019 (30 days), we sample the seed data at every full hour between 00:00 and 22:00 for each of the eight given cities. Additionally, for exactly the same samples, we flip the data vertically and horizontally. This means that we flip the static images, as well as the dynamic traffic movies, and also rearrange the data channels accordingly. This results in a second dataset (denoted as mirrored) that is used to evaluate spatial generalization to new cities. The evaluation metric corresponds to the MSE between prediction and target image that are both scaled by 255 to match the original scale of the given data.

Fig. 7 shows predictions and ground truth images for Berlin. The color values are log-scaled.

Fig. 8 shows the MSE by city (left) and the MSE by time (right) for both evaluation sets. The average performance is also included (left plot; dashed lines). A high variation of performance across cities is observable. Furthermore, our model consistently performs better than average on the cities Melbourne and Barcelona, but consistently worse on the cities Istanbul, Moscow and Berlin.

The right plot in Fig. 8

shows that samples drawn at different points in time of a day cause variation in performance. Overall, predictions of samples drawn at nighttime, between 01:00 and 04:00 show a smaller error compared to predictions of samples drawn during daytime. This is probably due to the overall traffic activity, that is lower at nighttime. The difference in performance measured on the two evaluation sets is very small, which demonstrates nicely the ability of the model to generalize to novel spatial situations.

Table II shows an excerpt of the leaderboard of the competition. The naive average corresponds to a model that calculates the average over the input frames and outputs the result for all future frames. Graph-ResNet by pmlr-v123-martin20a

was used as a more sophisticated baseline by the competition hosts. The graph-creation method was changed from using the dynamic data to build the graph, to using high-resolution street maps. The Graph-ResNet was trained for a single epoch only on traffic data from Berlin. We ranked seventh place in the core- and fourth place in the extended challenge. In both, our model outperformed the baselines significantly. A comparison of performance between the two challenges is difficult, as different cities are used. However, we can compare the score ratio of our model to the baselines (s. relative score in Tab.

II). the Graph-ResNet has a relative score of , whereas our proposed model has relative score, which suggests better spatial generalization.Model | Competition | Rel. Score | |
---|---|---|---|

T | ST | T / ST | |

Naive Average | 53.406 | 63.14 | 0.846 |

Graph-ResNet BER (pmlr-v123-martin20a) | 51.714 | 61.461 | 0.841 |

Vanilla U-Net (ronnebergerUNetConvolutionalNetworks2015) | 51.283 | - | - |

Hybrid U-Net (ours) | 50.521 | 60.222 | 0.839 |

U-Net + multi-task (T) (u_net_lu) | 48.422 | - | - |

U-Net + multi-task (ST) (u_net_lu) | - | 59.586 | - |

U-Net Ensemble (T) (u_net_sungbin) | 48.494 | - | - |

U-Net Ensemble (ST) (u_net_sungbin) | - | 59.559 | - |

### Iv-a Ablation Study and Comparison to Vanilla U-Net

To assess the effectiveness of the directional subgraphing, we train a model that is applied to the complete input graph instead of the four subgraphs. This is done simply by removing our modifications from the full GN Block. Thereby, the graph operations become locally permutation invariant, instead of being sensitive to the neighborhood topology. We refer to this simplified version of our model as Graph U-Net.

Table III shows the MSE measured on the evaluation dataset consisting of cities from the training set and on the mirrored evaluation dataset (denoted as MSE*). It can be observed that our Hybrid U-Net consistently outperforms the Graph U-Net on the evaluation dataset, whereas the results measured on the mirrored data are very similar for the two models. To quantify spatial generalization, we compute the ratio between MSE and MSE* (denoted as rel. MSE). A relative MSE of means that the model produces exactly the same error on the mirrored cities as on the known cities. This would indicate that the performance is not impacted by the flipping of the data which corresponds to perfect spatial generalization. The average rel. MSE measured for the Graph U-Net is consistently larger compared to the rel. MSE of our proposed Hybrid U-Net model. This indicates that Graph U-Net generalizes better to new cities than our Hybrid U-Net model.

Table III also shows the results for a vanilla U-Net. The vanilla U-Net consists of 8 consecutive down- and upsampling blocks, respectively. The MSE measured on the evaluation set is very close to our proposed model, but on average performs slightly worse. In contrast the MSE* measured on the flipped cities is significantly worse, which is evidence for worse spatial generalization capabilities. This is further supported by the considerably worse relative MSE found for the vanilla U-Net.

Hybrid UNet | Graph UNet | Vanilla UNet | |||||||
---|---|---|---|---|---|---|---|---|---|

MSE | MSE* | rel. MSE | MSE | MSE* | rel. MSE | MSE | MSE* | rel. MSE | |

ANTWERP | 48.35 | 49.034 | 0.986 | 48.819 | 49.186 | 0.993 | 48.193 | 50.712 | 0.95 |

BANGKOK | 39.466 | 40.338 | 0.978 | 39.729 | 40.045 | 0.992 | 39.444 | 40.908 | 0.964 |

BARCELONA | 28.742 | 29.502 | 0.974 | 28.968 | 29.284 | 0.989 | 28.609 | 29.663 | 0.964 |

BERLIN | 87.047 | 88.41 | 0.985 | 87.798 | 88.388 | 0.993 | 86.95 | 91.068 | 0.955 |

CHICAGO | 32.147 | 32.593 | 0.986 | 32.451 | 32.526 | 0.998 | 32.228 | 32.939 | 0.978 |

ISTANBUL | 61.237 | 62.028 | 0.987 | 61.98 | 62.262 | 0.995 | 61.588 | 64.3 | 0.958 |

MELBOURNE | 25.325 | 25.74 | 0.984 | 25.626 | 25.709 | 0.997 | 25.393 | 26.091 | 0.973 |

MOSCOW | 89.628 | 90.587 | 0.989 | 90.44 | 90.855 | 0.995 | 89.846 | 93.752 | 0.958 |

average | 51.493 | 52.279 | 0.985 | 51.976 | 52.282 | 0.994 | 51.531 | 53.679 | 0.96 |

## V Discussion and Conclusion

The given problems in the traffic forecast challenge have been approached following two different routes. Either using a visual model to process whole frames of the traffic movies, or using a GNN and process only pixels that actually depict a road. Intuitively, a graph-based approach is leveraging prior knowledge on the underlying structure of the street network, which should provide better generalization and transfer. Additionally, areas without a street are excluded from the graph and therefore don’t explicitly contribute to the prediction. This is intuitively beneficial as these areas don’t contain traffic information. This principle has already been demonstrated by pmlr-v123-martin20a. But as a drawback, a purely graph based approach is losing information on directionality which is crucial for the traffic forecasting challenge as the data is provided in such a format.

Here, we introduced a U-Net architecture with graph layers that we adapted to be sensitive to the geographical neighborhood topology by splitting the roadgraph into four directional dependent subgraphs. Furthermore, we utilize the 2D node position for graph down- and upsampling which effectively expands the receptive field and allows inference based on a larger portion of the road network. Although we have shown that the approach works in general, we did not perform extensive hyperparameter tuning, yet. Investigating the difference in performance between the different cities, refining the proposed up- and downsampling layers and scaling the complexity of the model are promising directions to explore in future work.