Log In Sign Up

Learning to integrate vision data into road network data

by   Oliver Stromann, et al.

Road networks are the core infrastructure for connected and autonomous vehicles, but creating meaningful representations for machine learning applications is a challenging task. In this work, we propose to integrate remote sensing vision data into road network data for improved embeddings with graph neural networks. We present a segmentation of road edges based on spatio-temporal road and traffic characteristics, which allows to enrich the attribute set of road networks with visual features of satellite imagery and digital surface models. We show that both, the segmentation and the integration of vision data can increase performance on a road type classification task, and we achieve state-of-the-art performance on the OSM+DiDi Chuxing dataset on Chengdu, China.


page 1

page 2

page 3

page 4


Visual Feature Encoding for GNNs on Road Networks

In this work, we present a novel approach to learning an encoding of vis...

RoadTagger: Robust Road Attribute Inference with Graph Neural Networks

Inferring road attributes such as lane count and road type from satellit...

Graph Input Representations for Machine Learning Applications in Urban Network Analysis

Understanding and learning the characteristics of network paths has been...

Communication in Complex Networks

The investigation of properties of networks has many applications and is...

Road Friction Estimation for Connected Vehicles using Supervised Machine Learning

In this paper, the problem of road friction prediction from a fleet of c...

Dynamic loss balancing and sequential enhancement for road-safety assessment and traffic scene classification

Road-safety inspection is an indispensable instrument for reducing road-...

Graph Convolutional Networks for Road Networks

Machine learning techniques for road networks hold the potential to faci...

1 Introduction

Knowledge about road networks (RNs) and the traffic flowing through it is key for making good decisions for connected and autonomous vehicles. It is therefore of large importance to find meaningful and efficient representations of RNs to allow and ease the knowledge generation.

However, the structural information encoded in spatial network data like RNs has two shortcomings: first, the incompleteness of the encoded information and second the absence of information that is difficult to encode yet still relevant. On the other hand, there exist vast amounts of unstructured image data that contains complementary data. Therefore, we propose to enrich the structured but incomplete spatial network data with unstructured but spatially complete data in the form of continuous imagery. We demonstrate the integration of vision data into a Graph Neural Network (GNNs) by low-level visual features and propose a segmentation of the road network graph according to spatial and empirical traffic constraints.

Specifically, we address the challenge of enriching crowd-sourced RN data from OpenStreetMap (OSM) [14] with remotely sensed data from Maxar Technologies [18]. We do so by utilizing GPS tracks from DiDi Chuxing’s ride hailing service [10] to spatially segment the RN based on empirical travel times. We evaluate our proposed method on a node classification task for road type labels using GraphSAGE [7].

The results confirm that our approach leads to improved performance on a classification of road type labels in both supervised and unsupervised learning settings. To summarize, our contributions are: 1) integrating image data through low-level visual features into a graph representation of spatial structures by means of spatial segmentation 2) a systematic evaluation of our proposed method in a supervised and unsupervised learning of node classifications.

2 Related Work

2.1 Geographical Data on Road Networks

Graph data on RNs is spatial network data, which is nowadays easily accessed through common mapping sources. The RN is represented as an attributed directed spatial graph , with intersections as nodes and roads connecting these intersections as edges . From the crowd-sourced OSM [14], intersection attributes , and road attributes can be obtained. One shortcoming of the available crowd-sourced RN data is that some attributes are not consistently recorded. Either data is missing completely for certain geographical regions, or it is inconsistently set [4].

Remote sensing data is data collected from air- or spaceborne sensors like radars, lidars or cameras. It requires extensive data preprocessing like atmospheric, radiometric and topographic error correction. Nowadays, analysis-ready remote sensing data is available which has undergone correction.

2.2 Machine Learning Concepts

2.2.1 Graph Neural Networks

GNNs experienced a surge in recent years. Many architectures have been proposed to produce deep embeddings of nodes, edges or whole graphs [8, 23]

. Techniques from other deep learning domains such as computer vision and natural language processing have been successfully integrated into GNN architectures 

[23]. In our work we focus on learning node embeddings with GraphSAGE [7] - a GNN architecture which relies on an efficient sampling of neighbor nodes and an aggregation of their features.

2.2.2 Visual Feature Extraction

Visual feature extraction is the process of obtaining information from images to derive informative, non-redundant features to facilitate subsequent machine learning tasks. Convolutional Neural Networks (CNNs) replaced hand-craft extractors in most applications. CNNs either require data of the application to train or need to be pre-trained on enormous datasets to then be transferred to the application domain 

[15]. We explicitly evaluate a simple pixel statistic - intensity histograms [6] - in our approach.

2.3 Machine Learning on Geographical Data

2.3.1 Machine Learning on Road Networks

Examples of machine learning tasks that are applied to RNs range method-wise from classification and regression to sequence-prediction and clustering [5, 11, 21], and application-wise from vehicle-centric predictions such as next-turn, destination and time of arrival predictions or routing to RN-centric predictions such as speed limit, travel time or traffic flow predictions [21, 13].

Following the growing popularity of OSM, several machine learning methods have been proposed in the recent years to either improve or to use OSM data [19]

. For RNs, conventional machine learning techniques that do not exploit the graph structure have been used to predict road type labels or to impute missing attributes and topologies 

[4, 12].

GNNs offer the advantage that graph topologies are directly exploited, and no additional features need to be constructed. Consequently, several authors have demonstrated the effectiveness of GNNs on RNs in classifications [11, 21, 9], regressions [11] and sequence predictions [21]. Jepsen et al. [11] propose a relational fusion network (RFN), which use different representations of a road network concurrently to aggregate features. Wu et al. [21] developed a hierarchical road network representation (HRNR) in which a three-level neural architecture is constructed that encodes functional zones, structural regions and roads respectively. He et al. [9] proposed an integration of visual features from satellite imagery through CNNs as node features to a GNN.

2.3.2 Visual Feature Extraction in Remote Sensing

As in many other domains, transfer learning has also been applied in remote sensing to address common tasks like scene classification 

[15] or object detection [3]. On the other hand, hand-crafted features combined with classical machine learning algorithms are still commonly used in computer vision on remote sensing. Object-based land cover classification frequently uses hand-crafted texture or intensity histogram features and still achieve state-of-the-art performance [16, 17, 20].

3 Materials & Methods

3.1 Road Network Graph

Our datasets consist of RN data, GPS tracks and remote sensing imagery. We follow the commonly used approach of dual graph representation of RNs [11, 5, 21, 13]. That means, the graph consists of road segments as nodes and connection between roads segments as edges .

In this study, RN data is obtained from OSM [14] using OSMnx [2]. We add traffic information to the road attributes by matching GPS tracks of ride hailing vehicles to the RN [22].

3.2 GraphSAGE

GraphSAGE [7] is a graph convolutional neural network, which for an input graph produces node representations at layer depth . This is done by aggregating features from neighboring nodes using an aggregation function and after a linear matrix multiplication with a weight matrix

apply a non-linear activation function

, such that


At layer depth the aggregated features consist of the neighboring nodes’ attributes. The node representations after the last layer (i.e. ) can be used as node embeddings

which serve as input for a downstream machine learning task, like a classifier in our case. The interested reader is referred to Hamilton

et al. [7] for implementation details of GraphSAGE.

a) b) d)
Figure 1: Processing steps of our proposed method. a) Graph with intersections in green and roads in yellow. Overlaid is with interstitial vertices in red. b) Rectangular road segment footprints in blue. c) TrueOrtho RGB-imagery. Pixels within each blue rectangle are added to the road segment attributes.

3.3 Contributions

We propose to enrich the spatial network data of RNs with visual features from remote sensing data and present a segmentation of the RN into a fine-grained graph representation. Figure 1 outlines the processing steps of our proposed method.

3.3.1 Segmentation

While the topological representation of an RN is advantageous for routing applications, we argue that a more fine-grained representation of an RN is better suited for many machine learning tasks. In a topological RN, road geometries may vary in length, and road attributes might be non-static for the entirety of a road.

Therefore, we introduce a segmentation which creates a more fine-grained representation of the RN with shorter edges. To take the spatio-temporal nature of traffic on the RN into account, the segmentation aims to create segments of target travel times as well as target segment lengths.

1:procedure Segment()
2: Input graph , edge attribute and target value
3:     for  do
5: Calculate number of segmentations
7:         for  do
8:              interpolate()
9: Add node at -distance of between and
13:         end for
16:     end for
17:     return
18:end procedure
Algorithm 1 Segmentation Algorithm.

Given a target segment travel time and a target segment length , we determine the number of equally distanced split points along a road geometry. At these points interstitial nodes are inserted, replacing original edges with shorter edges. From the original RN , the segmented RN is thus created by running Algorithm 1 first with and secondly with . The segmentation is done prior to the conversion of graph to its dual representation .

3.3.2 Integration of Vision Data

Though the traffic information and segmentation enriches the attribute sets, the structural information in the spatial network of still suffers from incompleteness in the encoded information, as not all attributes are consistently set in the crowd-sourced data from OSM. Furthermore, potentially relevant information in the vicinity of the spatial network are not covered in the attribute sets.

Continuous image data on the other hand has the potential to capture such information. Hence, we enrich the road attribute set with visual features from remote sensing data. Analysis-ready high-resolution orthorectified satellite imagery and a Digital Surface Model (DSM) are used in our study. Rectangular image patches around a road footprint are used to extract intensity histograms per channel as visual features. These features have the benefit compared to CNNs that they require no additional training.

3.3.3 Summary

We present a method to enrich structural, but incomplete, spatial network data with less-structured, but spatially complete data in the form of continuous imagery. We rely on simple low-level visual features, that require no learning, and demonstrate the effectiveness of this approach in the context of crowd-sourced RN data and high-resolution remote sensing imagery. Our work is closely related to RoadTagger by He et al. [9]. However, RoadTagger relies purely on CNNs to extract visual features and trains them end-to-end with the GNN, whereas we fuse low-level visual features with the road network attributes from OSM. Moreover, we extend the experiments also to unsupervised learning and demonstrate that it can achieve comparable performance on the binary road type classification problem. In general, our proposed method offers a light-weight alternative to CNN-based approaches, while achieving comparable performances.

4 Experiments

4.1 Datasets

An RN from the city of Chengdu, China is extracted from OSM [14]. We match GPS trajectories from ride-hailing vehicles to the RN. We use high-resolution (0.5 m/pixel) analysis-ready orthorectified satellite imagery (TrueOrtho) and a Digital Surface Model (DSM) as vision data.

We construct three datasets: The unsegmented, original RN (ORN), the segmented RN (SRN) and the segmented RN visual features (SRN+Vis). In SRN and SRN+Vis, interstitial nodes are inserted according to the proposed segmentation in Algorithm 1. We set the target travel time to 15 s and target segment length to 120 m. In ORN, we randomly sample 20% of the nodes in for validation and 20% for testing. The remaining nodes are used for training. Validation and test set allocations are propagated down from ORN to SRN and SRN+Vis.

The following features make up the feature set: Geographical features of length, bearing centroid, geometry which is the road geometry resampled to a fixed number of equally-distanced points and translated by the centroid to yield relative distances in northing and easting (meters). Binary features depicting one-way, bridge and tunnel. GPS features from the matched trajectories consisting of travel times as median travel times and throughput as average daily vehicle throughput. Additionally, we extract visual features from image patches of 120 m by 120 m from TrueOrtho and DSM following the bearing. In SRN+Vis the pixel values of each channel (3 RGB channel for the TrueOrtho, 1 graylevel channel for DSM) are binned into a histogram of 32 bins.

4.2 Training

To demonstrate how the RN segmentation and the integration of visual features improve the node embedding, we train for node classifications of road type labels.

The highway label from OSM is used as the target. The label describes the type of road as an indicator of road priority. The classes are motorway, trunk, primary, secondary, tertiary, unclassified, residential and living street. Living street and motorway are underrepresented in both ORN and SRN with less than 2% of all samples. This class imbalance makes the 8-class classification a challenging problem. Additionally, to set this work into context of RoadTagger [9], we perform a binary classification, by aggregating the 8-class predictions of the first four classes (motorway, trunk, primary, secondary) and the remaining for classes (tertiary, unclassified, residential, living street) to a single class each.

We train a GraphSAGE model with layer depth

and mean pooling aggregators using Adam optimizer. Model selection is based on validation performance and reported are test performances. We perform a Bayesian hyperparameter search 

[1]. The search space of hyperparameters is composed of hidden units , embedding dimensionality , learning rate , weight decay and dropout rate

. Unsupervised models are trained for 20 epochs with a batch size of 1024. Supervised settings are trained for 100 epochs with a batch size of 512.

5 Numerical Results

Tables 1 and 2 show the test results on road type classification in micro-averaged F1-Scores for supervised and unsupervised learning respectively. A majority voting from SRN and SRN+Vis to ORN is stated. The first two rows depict performance on the 8-class classification problem of eight road type labels. The performance on the binary classification is depicted in the last two rows.

The supervised results in table 1, the majority vote of the different SRN subsets shows clearly that SRN+Vis produces the best performing model with a 13.5% improvement compared to ORN. The segmentation alone (SRN) improves the performance only to a small extent of 1.4%. When aggregating the predictions to two classes, SRN+Vis again achieves the best performance with a 0.7% improvement over ORN.

In the unsupervised setting in table 2, the majority voting from SRN is onpar with ORN, while SRN+Vis achieves a small improvement of 0.4%. For the binary classification, SRN shows a small performance decrease of 0.7%, while SRN+Vis improves the classification by 0.2%. Curiously, the unsupervised node classification on the binary classification exceeds performance of supervised node classification.

Overall, the performances on binary classification of 0.911 and 0.915 in supervised and unsupervised respectively, are comparable with the performance of RoadTagger [9], which achieved an accuracy of 93.1% on a similar dataset.




+ Majority Vote
0.580 0.588 0.658
Percentage Gain
0.0% 1.4% 13.5%
GNN (2-class)
+ Majority Vote
0.905 0.905 0.911
Percentage Gain
GNN (2-class)
0.0% 0.0% 0.7%
Table 1: Results on supervised node classification.



+ Majority Vote
0.532 0.532 0.534
Percentage Gain
0.0% 0.0% 0.4%
GNN (2-class)
+ Majority Vote
0.913 0.907 0.915
Percentage Gain
GNN (2-class)
0.0% -0.7% 0.2%
Table 2: Results on unsupervised node classification.

6 Discussion & Conclusion

We have presented a method to enrich incomplete, structural spatial network data with complete, continuous image data. In the example of road type classification on crowd-sourced RNs, we demonstrated how low-level visual features from remote sensing data can be included into the attribute set of the RN. Moreover, we presented a segmentation based on spatial and empirical traffic information.

The results of our experiments show, that low-level visual features like pixel intensity histograms, can improve performance on both supervised and unsupervised node classification of road type labels. Moreover, we have shown that SRN+Vis can achieve performances similar to state-of-the-art models for binary road type classifications in both supervised and unsupervised settings.

Future work, should investigate to what extent the replacement of low-level visual features through CNNs like in RoadTagger [9] is beneficial for supervised and unsupervised node classifications in RN. Another interesting direction of research is to analyse how the unsupervised embeddings can be used for multiple different machine learning tasks that are relevant from a RN perspective.


  • [1] L. Biewald (2021 [Online]-08-08) Experiment tracking with weights and biases. Note: Software available from External Links: Link Cited by: §4.2.
  • [2] G. Boeing (2017) OSMnx: new methods for acquiring, constructing, analyzing, and visualizing complex street networks. Computers, Environment and Urban Systems 65, pp. 126–139. Cited by: §3.1.
  • [3] Z. Chen, T. Zhang, and C. Ouyang (2018) End-to-end airplane detection using transfer learning in remote sensing images. Remote Sensing 10 (1), pp. 139. Cited by: §2.3.2.
  • [4] S. Funke, R. Schirrmeister, and S. Storandt (2015) Automatic extrapolation of missing road network data in openstreetmap. In Proceedings of the 2nd International Conference on Mining Urban Data-Volume 1392, pp. 27–35. Cited by: §2.1, §2.3.1.
  • [5] Z. Gharaee, S. Kowshik, O. Stromann, and M. Felsberg (2021) Graph representation learning for road type classification. Pattern Recognition, pp. 108174. Cited by: §2.3.1, §3.1.
  • [6] R. C. Gonzalez, R. E. Woods, et al. (2002) Digital image processing. Prentice hall Upper Saddle River, NJ. Cited by: §2.2.2.
  • [7] W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Proceedings of the 31st NeurIPS, pp. 1025–1035. Cited by: §1, §2.2.1, §3.2, §3.2.
  • [8] W. L. Hamilton (2020) Graph representation learning.

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    14 (3), pp. 1–159.
    Cited by: §2.2.1.
  • [9] S. He, F. Bastani, S. Jagwani, E. Park, S. Abbar, M. Alizadeh, H. Balakrishnan, S. Chawla, S. Madden, and M. A. Sadeghi (2020) RoadTagger: robust road attribute inference with graph neural networks. In Proceedings of the AAAI, Vol. 34, pp. 10965–10972. Cited by: §2.3.1, §3.3.3, §4.2, §5, §6.
  • [10] T. G. Initiative (2021 [Online]-05-03) Https:// appen-vue/. External Links: Link Cited by: §1.
  • [11] T. S. Jepsen, C. S. Jensen, and T. D. Nielsen (2020) Relational fusion networks: graph convolutional networks for road networks. IEEE Transactions on Intelligent Transportation Systems. Cited by: §2.3.1, §2.3.1, §3.1.
  • [12] J. Kaur and J. Singh (2018) An automated approach for quality assessment of openstreetmap data. In 2018 International Conference on Computing, Power and Communication Technologies (GUCON), Vol. , pp. 707–712. External Links: Document Cited by: §2.3.1.
  • [13] J. Liu, G. P. Ong, and X. Chen (2020) GraphSAGE-based traffic speed forecasting for segment network with sparse data. IEEE Transactions on Intelligent Transportation Systems. Cited by: §2.3.1, §3.1.
  • [14] OpenStreetMap (2021 [Online]) Planet dump retrieved from on 2021-06-10. Cited by: §1, §2.1, §3.1, §4.1.
  • [15] R. Pires de Lima and K. Marfurt (2020) Convolutional neural network for remote-sensing scene classification: transfer learning analysis. Remote Sensing 12 (1), pp. 86. Cited by: §2.2.2, §2.3.2.
  • [16] O. Stromann, A. Nascetti, O. Yousif, and Y. Ban (2020)

    Dimensionality reduction and feature selection for object-based land cover classification based on sentinel-1 and sentinel-2 time series using google earth engine

    Remote Sensing 12 (1), pp. 76. Cited by: §2.3.2.
  • [17] A. Tassi and M. Vizzari (2020) Object-oriented lulc classification in google earth engine combining snic, glcm, and machine learning algorithms. Remote Sensing 12 (22), pp. 3776. Cited by: §2.3.2.
  • [18] M. Technologies (2021 [Online]-08-08) Https:// External Links: Link Cited by: §1.
  • [19] J. E. Vargas-Munoz, S. Srivastava, D. Tuia, and A. X. Falcao (2020) OpenStreetMap: challenges and opportunities in machine learning and remote sensing. IEEE Geoscience and Remote Sensing Magazine 9 (1), pp. 184–199. Cited by: §2.3.1.
  • [20] N. Verde, I. P. Kokkoris, C. Georgiadis, D. Kaimaris, P. Dimopoulos, I. Mitsopoulos, and G. Mallinis (2020) National scale land cover classification for ecosystem services mapping and assessment, using multitemporal copernicus eo data and google earth engine. Remote Sensing 12 (20), pp. 3303. Cited by: §2.3.2.
  • [21] N. Wu, X. W. Zhao, J. Wang, and D. Pan (2020) Learning effective road network representation with hierarchical graph neural networks. In Proceedings of the 26th ACM SIGKDD, pp. 6–14. Cited by: §2.3.1, §2.3.1, §3.1.
  • [22] C. Yang and G. Gidofalvi (2018)

    Fast map matching, an algorithm integrating hidden markov model with precomputation

    International Journal of Geographical Information Science 32 (3), pp. 547–570. Cited by: §3.1.
  • [23] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2020) Graph neural networks: a review of methods and applications. AI Open 1, pp. 57–81. Cited by: §2.2.1.