DeepDualMapper: A Gated Fusion Network for Automatic Map Extraction using Aerial Images and Trajectories

02/17/2020 ∙ by Hao Wu, et al. ∙ ByteDance Inc. FUDAN University 6

Automatic map extraction is of great importance to urban computing and location-based services. Aerial image and GPS trajectory data refer to two different data sources that could be leveraged to generate the map, although they carry different types of information. Most previous works on data fusion between aerial images and data from auxiliary sensors do not fully utilize the information of both modalities and hence suffer from the issue of information loss. We propose a deep convolutional neural network called DeepDualMapper which fuses the aerial image and trajectory data in a more seamless manner to extract the digital map. We design a gated fusion module to explicitly control the information flows from both modalities in a complementary-aware manner. Moreover, we propose a novel densely supervised refinement decoder to generate the prediction in a coarse-to-fine way. Our comprehensive experiments demonstrate that DeepDualMapper can fuse the information of images and trajectories much more effectively than existing approaches, and is able to generate maps with higher accuracy.



There are no comments yet.


page 1

page 4

page 7

page 8

page 11

page 12

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Generating city maps is a fundamental building block of many location-based applications like navigation, autonomous vehicle, and so on. Multiple data sources can be leveraged to automatically generate the map. In this paper, we focus on two data sources, namely aerial images and GPS trajectories. The former is captured by the satellites. Such images are publicly available in many places, eg., Google Earth can stay up-to-date, and have been explored extensively in the past decades for map extraction ( roadtracer roadtracer; dlinknet dlinknet; stackedunet stackedunet; robocodes robocodes, hinton_mapinference hinton_mapinference; casnet casnet). The latter captures human movements in the urban area as most human movements are constrained by the underlying road network. Thanks to the fast development of mobile platforms and localization techniques, collecting these GPS trajectories is much easier nowadays. How to extract the map from trajectory data is also a hot research problem (kde kde; tc1 tc1; trajsift trajsift; cobweb cobweb).

Figure 1: Examples showing some limitations in aerial images and trajectories (visualized by plotting the GPS points). Eg., roads are covered by trees and trajectories are sparse on minor roads. However, by leveraging both data sources, such problems are expected to be eliminated.

However, neither data source is perfect for the task of automatic map extraction because of the information loss issue. There is inevitably missing information in the data which makes it extremely challenging, if not impossible, to infer the map accurately. Eg., as shown in Fig. 1, in aerial image, some roads may be occluded by trees and buildings; in trajectory data, some less popular roads may have very few, or zero, historical trajectories passing by. Such information loss introduces more complexity to the map extraction process.

As aerial images and trajectory data capture different types of information, the information missed in one source might be available in the other source. In other words, combining two data sources that are complementary offers an effective way to make full use of the information that can be well captured by at least one data source. In the current literature, a variety of researches have tried to fuse aerial images with other auxiliary data sources, including earth observation data including Radar [12], Lidar (beyondrgb beyondrgb; l3fsn l3fsn; triseg triseg), OpenStreetMap data [3] and street view images [9], to solve the urban scene semantic segmentation problem, which, to some extent, is related to map extraction task. However, these approaches either simply concatenate the features of two modalities (i.e., early-fusion), or compute the average of the predictions made by each modality (i.e., late-fusion). We argue that such rough designs fail to fully explore the utilities of both modalities and are not able to effectively address the issue of information loss faced by both modalities. Among these approaches, the most relevant work solves the same problem by proposing a new deconvolution strategy called 1D decoder [28]. However, that work also simply concatenates the image features and the trajectory heat map which leaves room for further improvement.

Consequently, we propose a novel fusion model, namely DeepDualMapper, which aims to fuse the aerial image and trajectory data more seamlessly. We design a Gated Fusion Module (GFM) to explicitly learn the modality selection based on the confidence of each data source. It controls the information flows of aerial images and trajectories by explicitly defining a learnable weight to make the fusion process complementary-aware, i.e., giving a higher weight to the data source that is more trustworthy and more valuable to infer the answer. Such a design is consistent with the human decision-making process when we need to choose one between two data sources with different confidences. We also introduce a refinement process, conducted through the Densely Supervised Refinement (DSR) strategy, to refine the prediction generated by GFM from coarse to fine via residual refinement learning.

In summary, we have made three major contributions in this paper. First, we propose a novel data fusion model called DeepDualMapper which utilizes both aerial images and trajectory data more effectively for the task of map extraction. Second, we design a novel gated fusion module and a refinement decoder that can adaptively control the information flow of both modalities and select the one being more reliable in a coarse-to-fine refinement manner. Such a design follows the heuristics of the human decision-making process when judging two data sources. Third, our model not only outperforms the baselines in three real datasets but also demonstrates a superior resilience to information loss as it can generate the map with higher accuracy than all the existing competitors even when both modalities have 25% information loss.

Related work

Map Extraction via Aerial Images.

Many approaches have been proposed to extract a digital map from aerial images. [19]

uses a feed forward neural network to detect the roads with unsupervised pre-training. After CNN has demonstrated its powerful predictability on visual tasks, many approaches adopt CNNs on this task, such as

[1] (with FCN), [10] (with DeconvNet), and [11] (with SegNet). [18] and [6] are two representative approaches proposed recently for extracting road network topology instead of directly extracting a binary representation of the map. [29] and [27] adopt U-Net-like structures to generate the maps. These approaches purely rely on the aerial image which has its limitations (eg.roads covered by trees and shadows are not visible) and require high-quality optical sensors.

Map Extraction via Trajectory Data.

Approaches leveraging trajectory data mainly rely on clustering techniques to cluster the trajectories or GPS points located on the same road and then to extract a road from each cluster. [24]

performs k-means style clustering on GPS samples to generate road.

[8] clusters the trajectories into a growing road network. [7]

is a representative method transforming trajectories into a discretized image and adopts kernel density estimation to extract the road network. Among these approaches,

[17] is reported as the one achieving the best performance [16]. Recently, map update has also attracted some attentions, which is to incrementally discover new roads and add them to the existing road network (crowdatlas crowdatlas; cobweb cobweb; hymu hymu; glue glue; trajsift trajsift). In general, these trajectory-based approaches solve the problem in heuristics but not learning-based manner, which makes them inevitably inferior to those learning-based approaches (will be shown in experiments).

Data Fusion for Aerial Images.

In the literature, there are some research works using aerial images with the assistance of other remote sensing data sources to reinforce the result. Semantic labeling of urban areas is a representative application for fusion of aerial images and other sensors such as the Lidar point cloud sensor data (beyondrgb beyondrgb; l3fsn l3fsn), laser data [2], OpenStreetMap data [3] and street view images [9]. Late fusion strategy is adopted by (rgbosmfuse rgbosmfuse; segnetrc segnetrc; beyondrgb beyondrgb; l3fsn l3fsn). [2] improves it by a residual correction strategy called SegNet-RC. Encoder fusion strategy is applied in (rgbosmfuse rgbosmfuse; beyondrgb beyondrgb; anotherfusenet anotherfusenet; l3fsn l3fsn) via FuseNet [13]. V-FuseNet, an advanced version, is proposed in [4]. Besides, [12] uses multiple sensors including satellite image to segment the flooded buildings. It fuses multiples sensors through late fusion strategy. [21] and [28] are the only existing works for fusing aerial images with other auxiliary data source to detect the roads. The former fuses the RGB image with Lidar data while the latter fuses the image with trajectory data to generate the map. However, most of these works either simply concatenate the features or average the predictions of both modalities to perform data fusion and hence have not fully utilized both data sources.

Figure 2: The main fusion architecture of our model. The right part illustrates the details of the gated fusion module. We omit the details of the encoder as it is the same as U-Net. Note that all dense supervisions do not appear in the figure due to the clarity of the image. Please refer to Fig. 5 for the details. All annotated dimensions are assumed by feeding the input in the size of , for a clearer understanding of each layer. Best viewed in color.

Deep Dual Mapper

Model Overview

We follow the flow of image-based map extraction approaches to transfer the map extraction task into a pixel-wise binary prediction task (dlinknet dlinknet; stackedunet stackedunet; casnet casnet; 1ddecoder 1ddecoder). We select U-Net as the main framework for its excellent performance in semantic segmentation, especially biomedical image segmentation, with a simple and elegant structure [23]. For the input data, we crop the whole area by grids and the trajectory feature is represented by counting the historical GPS points in each pixel. Please refer to the supplementary material for more details on the model and data preprocessing procedure.

Fig. 2 illustrates the main architecture of DeepDualMapper. It consists of two independent U-Nets with each taking in a branch of data source as the input. The original decoders of those two U-Nets serve as the auxiliary decoders to preserve the key information of aerial images and that of trajectories respectively. We extract the activation maps of two branches w.r.t.all scales, i.e., and and fuse them through the gated fusion module which produces the fused information . Then, we propose the main decoder, the densely supervised refinement decoder, which shares the same structure as U-Net’s decoder. It takes the fused features as input and generates the pixel-wise prediction. Finally, it generates a feature map that shares the same spatial dimension as the original input dimension and adopts a linear binary predictor to generate pixel-wise predictions. In the following sections, we will detail the gated fusion module and the densely supervised refinement decoder.

Gated Fusion Module

One of the key novelties of our model is the Gated Fusion Module (GFM), which is proposed to simulate the human decision-making process. Given an analytic task and two data sources, we normally select the data source that is more informative and provides more valuable input for the given task. That is to say, for the task of map extraction, we prefer the data source that makes the inference of the roads much easier and provides more useful information to the other one. To be more specific, it takes the activation maps from two modalities as inputs, i.e., for aerial image and for trajectory where refers to the level (1 5), and outputs the fused features . GFM is composed of two sub-modules, i.e., adapter and selector, as detailed in the following.

As the two activation maps are extracted from two independent branches and there is a potential space inconsistency between two modalities, directly fusing them is not an ideal solution. Accordingly, the adapters are introduced to transfer the activation spaces and to a uniform space to enable a linear combination of the two activation maps in such a uniform space. We denote the adapted features as and . Here, and are the channel dimension-preserving adapter operations implemented by the convolution, i.e., the channel dimension is linearly transferred into a uniform space . , , and refer to the height, width and channel dimension of the feature map respectively. The activation maps are in a uniform space after the transformation, and they can be safely fused by a linear combination, i.e.,

where and refer to the gate values w.r.t. aerial image and trajectory respectively; denotes the element-wise product. The complementary constraint enforces the module to learn a complementary-style fusion. It implies that if certain area in one modality has higher confidence in generating the prediction, it will be assigned a larger weight which simultaneously reduces the weight of the other modality with lower confidence in prediction. If both modalities have sufficient useful information to make a good prediction, the exact values of their weights are not important.

The selector sub-module computes the gate values and based on the information of two modalities, i.e., and . To get the values of and , we first compute the un-normalized predictions . and are extracted from the slices of . Softmax function is adopted to normalize the gate value w.r.t. each location in to meet the complementary constraint as shown in Eq. (2). The un-normalized gates is computed in a recursive manner formulated in Eq. (1), which means the decision of the gate at level is based on the decision made at level () with some residual refinement . The intuition is that we want to perform a consistent regularization on the gate decision space across all scales, i.e., the gates generated from different resolution granularities should not differ much, as the example shown in Fig. 3. Note that has its scale different from , which can be solved by a non-parametric up-sampling operation . For the residual refinement , it is fed by the concatenation of and (i.e., denotes the concatenation of two feature maps along the channel dimension), followed by two convolution blocks, denoted by

, with batch normalization and ReLU activation. We adopt the stack of

convolutions in order to enlarge the receptive field and to collect more global information from those two modalities. Hence, it can make more accurate decisions by a pixel-wise linear transformation which is implemented by an

convolution with the output channel at 2. The whole computation flow of the GFM is illustrated in the right part of Fig. 2.

Figure 3: An example showing the coarse-to-fine gate refinement procedure.

Densely Supervised Refinement Decoding

The GFM aims to simulate the human decision-making process for judging the information from two modalities. In this section, we propose a densely supervised refinement decoder (DSRD) to further process the fused information into the prediction map. We introduce it through following two operations, i.e., the fusion refinement and the dense supervision.

Fusion Refinement.

Recall that GFM outputs the fused features that contain the more useful information from two data sources which are expected to be more confident in making the prediction. Inspired by the residual refinement learning [14], here, we directly leverage such information as the base of the feature map and we want the decoder to learn a residual refinement of which can be formalized as the following equation.

denotes the refined features of level , based on . Here, is the residual refinement function. Fig. 4 shows an example of the generation of a prediction. First, both image and trajectory modalities make their own independent predictions (with information contained by and respectively), while these predictions might not be sufficient to provide a precise prediction. Next, the GFM fuses these two data sources by enlarging the one being more confident to predict the answer which leads to a more precise prediction. However, we can observe from Fig. 4 that such a linear combination may still face some issues, eg., isolate tiny roads and the un-smoothed predictions. Then, the decoder utilizes the fused feature as the base and learns some area-invariant refinements, such as smoothing the prediction and removing the isolate points/short branches. Such a fusion-refinement process is consistent with our intuition.

The refinement decoder still follows the elegant structure of U-Net, i.e., 4 decoding blocks composed by a stride 2 deconvolutional layer

for learnable up-sampling and two convolutions (with batch normalization and followed by an ReLu activation), denoted as . Concatenation is adopted to combine the information up-sampled from the previous level () and the information from the current level (). The difference is that U-Net concatenates the activation maps from encoders by skip-connection while the newly proposed DSRD concatenates the fused features produced by GFM. The following equation formalizes the residual refinement function. Fig. 5 shows the computation flow of the decoder at -th level.

Figure 4: An example showing the process of refinement.

Dense Supervision.

We want to highlight that, although we design a refinement structure which seems to be reasonable, we could not rely on minimizing the final prediction loss as the only objective. If we do so, the refinement structure might not be able to achieve the performance it is designed to achieve, as there is no guidance to ensure that the decoder has indeed learned how to fuse both modalities and how to refine the prediction. Thus, to enforce our refinement procedure to work as expected, we propose a shared prediction module, which can incorporate dense supervisions to explicitly regularize the learned features and to guide the learning procedure.

Recall that we have introduced two adapters and in GFM to transfer the feature maps and from spaces and to a uniform space . Consequently, the fused feature , which is the linear combination of and , will also lie in . Besides, the refined feature which is based on (with a learned residual) should also be located in space .

Accordingly, we decide to adopt a shared prediction module that takes the features in space as input and produces a binary prediction. Then, we simultaneously predict the labels from , , and , which involves

supervisions in total. The prediction module takes a tensor

having shape as the input and generates a binary prediction with the same spatial size, i.e., . It is implemented by an convolutional layer

to perform the linear transformation and a 2-classes softmax layer to predict the probability.

The dense supervision has two functions. First, it explicitly constraints space consistence. Recall that we have only assumed that the features produced by the adapters lie in the same space without any constraints in the previous sections. However, after adopting the shared prediction module that only accepts one feature space to generate the prediction, the learned space after the adaption and the space of and are forced to be transferred to the uniform space . Second, it optimizes the predictions generated by the fused feature , which is equivalent to optimize the gate values and . This supervision actually allows the gradient to be propagated by a short cut to the GFM to facilitate the learning of the fusion process.

Figure 5: Structure of the densely supervised refinement


We adopt pixel-wise cross entropy loss on all the predictions in all levels, i.e., predictions computed from , , and , where . We train the model via Adam optimization [15] with the learning rate at 1e-4. As our model involves many supervisions, the weights of supervisions , , are all set to and that of is set to (in all levels). Note that the performance is not very sensitive to the weights. At inference time, is used as the model output. Please refer to the supplementary material for more training details.


Experiment Setting

Dataset Porto Shanghai Singapore
Metric IoU F1-score IoU F1-score IoU F1-score
Trajectory-based Approaches
Aerial Image-based Approaches
Fusion-based Approaches
Early Fusion
Late Fusion
SegNet-RC (late fusion + correction)
FuseNet (encoder fusion)
V-FuseNet (encoder fusion)
1D decoder
Table 1: The performance of all approaches under three image-trajectory-paired datasets. TC1, KDE and COBWEB are not learning-based approaches thus their results are deterministic.

We use three real city-scale datasets containing aerial images and GPS trajectories for evaluation (Porto, Shanghai, and Singapore). Following (1ddecoder 1ddecoder; stackedunet stackedunet; dlinknet dlinknet; triseg triseg), we adopt intersection of union (IoU) and F1-score of road pixels as the evaluation metrics. We denote correctly predicted road pixels, i.e., true positive, as TP, correctly predicted non-road pixels, i.e., true negative, as TN, road pixels which are wrongly predicted as non-roads, i.e., false negative, as FN, and non-road pixels which are wrongly predicted as roads, i.e., false positive, as FP. The IoU is computed as

. Precision is computed as and recall is computed as

. F1-score, the harmonic mean of precision and recall, is computed as

. The details of the datasets and the metrics are presented in the supplementary material.

Figure 6: The visualization shows how gate value changes w.r.t. one data source being less confident in prediction. The left sample refers to the case where the resolution of an image is reduced gradually, and the right one refers to the case where the localization noises of trajectories are increased. White pixel indicates a large gate value and black pixel indicates a small value.
(a) Predictions of all data fusion models
(b) The gate value computed by GFM in all levels
Figure 7: Visualization of results and gate values under information loss attack. White pixel indicates higher gate value. Pay attention to the areas (in yellow boxes) of gate values which represent the areas attacked by the information loss.

Baselines. We incorporate a large number of competitors which can be clustered into three categories. The first category is trajectory-based approaches that extract the map based on clustering trajectories/GPS points. Representative works TC1 [17], KDE [7] and COBWEB [25]

are selected as the baselines. The second category is aerial image-based approaches which extract the map using computer vision techniques.

DeconvNet [20] (adopted in [10]), SegNet [5] (adopted in [11]), U-Net [23], DeepRoadMapper [18] and RoadTracer [6] are implemented as the competitors. We also include the U-Net which takes in trajectory data, pre-processed as an image, as input, denoted as U-Net. The last category is fusion-based approaches which take both aerial image and trajectory as inputs. We implement Early-fusion which concatenates the inputs (using U-Net), L3Fsn concatenating the features in the third convolutional block of FCN-8 [22], Late-fusion and the version with residual correction namely SegNet-RC[2], and FuseNet [13] as well as its advanced version V-FuseNet applied in (beyondrgb beyondrgb; anotherfusenet anotherfusenet; rgbosmfuse rgbosmfuse). TriSeg [21] and 1D Decoder [28], the only two existing state-of-the-art fusion approaches for map extraction task, are also implemented as the competitors.

Overall Evaluation

We evaluate the performance of different approaches in extracting the maps from three cities under test regions. The results are reported in Table 1

. First, most trajectory-based approaches do not perform well. This is consistent with our expectation, as they are based on trajectory clustering without any supervised learning procedure. U-Net

performs best among them as it is trained end-to-end. For those image-based approaches, DeepRoadMapper and RoadTracer perform better than the others in general as they use specifically designed CNN structure which is more powerful than other VGG-like models. In general, fusion-based approaches outperform the approaches in the above two categories. It demonstrates the power of combining different data sources and justifies that the combination of data sources that complement each other can significantly improve the overall performance. DeepDualMapper demonstrates its superior performance, as it consistently outperforms all the competitors in all three cases. Note that DeepDualMapper achieves a much stabler performance as its standard deviation tends to be small. Note that to prove the effectiveness of our fusion strategy and to assure the fairness of our evaluation, all the approaches evaluated use the same input features. Among other fusion-based approaches, TriSeg, 1D decoder, and V-FuseNet demonstrate certain advantages over others; while the performance ranking of remaining approaches is not clear.

Notice that among these three datasets, the performance on the Porto dataset is the best. The main reason is that Porto dataset has better data quality than others. Porto’s satellite image is more accurate and clearer than the other two. For trajectory data, the noise and error on trajectories of Porto are smaller than that of Singapore, and the coverage density is higher than that of Shanghai. Consequently, the overall performance of Porto is the best of all three datasets.

Visualization of the Gate Values

To offer a clearer view of how our fusion model behaves when it encounters two data sources with different confidence of predictability, we select a sub-region and gradually reduce the predictability in either modality. We report the gate values and of level 5 (i.e. the last level) in Fig. 6 to visualize the changes to the information flow when one data source’s quality drops. We can infer that when the image fades and starts to lose more and more details, it gradually loses its confidence in prediction. Therefore, is reduced (becomes blacker) while is increased (becomes whiter). However, it is still able to make a reasonable prediction even when the original aerial image has lost much important information. Similarly, we can make the same observations from the case where the noise of trajectory data increases.

Performance w.r.t. Trajectory Data Volume

Figure 8: The IoU of DeepDualMapper (dual modality) and U-Net (single modality) under Porto dataset w.r.t. different trajectory data qualities.

Fig. 8 plots the IoU performance of U-Net and DeepDualMapper under different trajectory data volumes. We only report the results of the Porto dataset for space saving, while similar results are observed from the other two datasets. U-Net is selected for comparison because it is the backbone of our model and it can be directly compared with DeepDualMapper that takes in multiple data sources as input. It is observed that U-Net is vulnerable to the reduction of trajectory data quality, while DeepDualMapper demonstrates resilience to the reduction of data quality. Although DeepDualMapper is relatively robust to data quality, it still has a minor performance drop. We think it is reasonable. When one modality (i.e.. trajectory data) loses most of its information, DeepDualMapper can only infer the information from the other modality (i.e., aerial image) with its performance similar to the U-Net of single modality (i.e., aerial image).

Evaluation on Robustness to Information Loss

Data fusion approaches will be effective only if they own the capability of learning how to compensate for the information loss of one modality by other modalities. As stated before, existing approaches roughly fuse the data sources so it remains unknown whether they have really learned how to correctly fuse the data and to maximize the utility of both data sources. To have a more thorough study, we conduct the information loss attack on testing datasets. In detail, we randomly remove the information of 1/4 area of the aerial image and another 1/4 area of the trajectory data, to see whether the fusion model can correctly find out the right modality containing the true information to perform the inference. We conduct the experiment in all three datasets and report the quantitative results in Table 2. In addition, we visualize the maps generated by all data fusion models in Fig. 7(a). Note that quadrant 3 of the aerial image and quadrant 1 of the trajectory data have been removed. Most of the baselines are vulnerable to the information loss attack while our model, to some extent, is able to defend such attack.

To further study how DeepDualMapper fuses the data and why it outperforms the other approaches under an information loss attack, we visualize the gate values computed by GFM from level 1 to level 5 in Fig. 7(b). We can observe that since quadrant 3 of the aerial image has been removed, GFM assigns higher weights to the trajectory data. Similarly, as quadrant 1 of the trajectory data has been removed, GFM detects that the aerial image carries more valuable information for the prediction. Accordingly, it passes the information of aerial image but blocks the trajectory information in quadrant 1. This example demonstrates the effectiveness of our gating mechanism for data fusion. In addition, from the gate values across different levels, we can observe that the gate values are refined from coarse to fine and the gate values of all levels meet the consistency constraint.

Dataset Porto Shanghai Singapore
Early Fusion
Late Fusion
1D decoder
Table 2: The reported IoU under the test set with information loss attack. Note we do not include the attack in training set.

Study on Densely Supervised Refinement

Recall that the DSRD is designed to be responsible for generating the predictions. We first evaluate the IoU performance of the predictions generated by the features of (image feature), (trajectory feature), (fused feature) and (refined feature) via the shared prediction module. The results are shown in Fig. 9(a). As compared with the information captured by the features or of single modality, the fused feature contains more precise information. In addition, we are able to observe the improvement achieved by over , which demonstrates the effectiveness of the refinement process. Last but not least, we claim that supervision can further improve the fusion process as the prediction made by the fused feature with the dense supervision is more accurate than that without the supervision, as reported in Fig. 9(b). For a clearer view, we visualize the predictions computed by the features of , , and in all 5 levels in Fig. 10. From the results, we can observe that the refinement procedure smooths and connects the roads and the result is generated through the feature maps from coarse to fine as well as from rough to smooth over five levels.

(a) IoU predicted by , , and
(b) IoU of vs. supervision
Figure 9: Quantitative results showing the utility of densely supervised refinement decoding. The IoU of predictions generated by features at each stage are gradually increased demonstrating the effectiveness of the fusion step and the refinement step.
Figure 10: Visualizations of the predictions generated by , , and in all 5 levels. The fused predictions are better than those of and . The refined predictions further smooth (and connect) the roads predicted by . Please pay attention to the differences in the dotted boxes.


In this paper, we have presented an automatic map extraction approach that can effectively fuse the information of aerial images and that of GPS trajectories. We further boost the performance and accuracy of map extraction task by proposing the gated fusion module and the densely supervised refinement decoder. We have demonstrated the effectiveness of our model through comprehensive experiment studies based on three city-scale datasets. In addition, we have implemented an information loss attack task. As expected, our model is much more robust to the attack, compared with a wide range of state-of-the-art competitors, with its resilience to the information loss mainly contributed by its delicately designed fusion structure. Since our method is still relatively simple for the preprocessing of trajectory data, we plan to consider other information such as speed and direction in the future and investigate whether performing more data augmentation could enhance the performance of the model.


We thank Zhangqing Shan for providing experimental results of trajectory-based approaches (i.e., TC1, KDE, COBWEB). This research is supported in part by the National Natural Science Foundation of China under grant 61772138, the National Key Research and Development Program of China under grant 2018YFB0505000, and the National Research Foundation, Prime Minister’s Office, Singapore under its International Research Centres in Singapore Funding Initiative.


  • [1] R. Alshehhi, P. R. Marpu, W. L. Woon, and M. Dalla Mura (2017) Simultaneous extraction of roads and buildings in remote sensing imagery with convolutional neural networks. ISPRS Journal of Photogrammetry and Remote Sensing 130, pp. 139–149. Cited by: Map Extraction via Aerial Images..
  • [2] N. Audebert, B. Le Saux, and S. Lefèvre (2016) Semantic segmentation of earth observation data using multimodal and multi-scale deep networks. In ACCV’16, pp. 180–196. Cited by: Data Fusion for Aerial Images., Experiment Setting.
  • [3] N. Audebert, B. Le Saux, and S. Lefèvre (2017) Joint learning from earth observation and openstreetmap data to get faster better semantic maps. In CVPR’17 Workshops, pp. 1552–1560. Cited by: Introduction, Data Fusion for Aerial Images..
  • [4] N. Audebert, B. L. Saux, and S. Lefèvre (2018) Beyond RGB: very high resolution urban remote sensing with multimodal deep networks. ISPRS Journal of Photogrammetry and Remote Sensing 140, pp. 20–32. Cited by: Data Fusion for Aerial Images..
  • [5] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis & Machine Intelligence (12), pp. 2481–2495. Cited by: Experiment Setting.
  • [6] F. Bastani, S. He, S. Abbar, M. Alizadeh, H. Balakrishnan, S. Chawla, S. Madden, and D. DeWitt (2018) RoadTracer: automatic extraction of road networks from aerial images. In CVPR’18, Cited by: Map Extraction via Aerial Images., Experiment Setting.
  • [7] J. Biagioni and J. Eriksson (2012) Map inference in the face of noise and disparity. In SIGSPATIAL GIS’12, pp. 79–88. Cited by: Map Extraction via Trajectory Data., Experiment Setting.
  • [8] L. Cao and J. Krumm (2009) From gps traces to a routable road map. In SIGSPATIAL GIS’09, pp. 3–12. Cited by: Map Extraction via Trajectory Data..
  • [9] R. Cao, J. Zhu, W. Tu, Q. Li, J. Cao, B. Liu, Q. Zhang, and G. Qiu (2018) Integrating aerial and street view images for urban land use classification. Remote Sensing 10 (10), pp. 1553. Cited by: Introduction, Data Fusion for Aerial Images..
  • [10] G. Cheng, Y. Wang, S. Xu, H. Wang, S. Xiang, and C. Pan (2017) Automatic road detection and centerline extraction via cascaded end-to-end convolutional neural network. IEEE Transactions on Geoscience and Remote Sensing 55 (6), pp. 3322–3337. Cited by: Map Extraction via Aerial Images., Experiment Setting.
  • [11] I. Demir, F. Hughes, A. Raj, K. Tsourides, D. Ravichandran, S. Murthy, K. Dhruv, S. Garg, J. Malhotra, B. Doo, et al. (2017) Robocodes: towards generative street addresses from satellite imagery. In CVPR’17 Workshops, pp. 1486–1495. Cited by: Map Extraction via Aerial Images., Experiment Setting.
  • [12] J. Fil, T. G. Rudner, M. Russwurm, B. Bischke, R. Pelich, V. Kopackova, and P. Bilinski (2018) Multinet: segmenting flooded buildings via fusion of multiresolution, multisensor, and multitemporal satellite imagery. arXiv preprint arXiv:1812.01756. Cited by: Introduction, Data Fusion for Aerial Images..
  • [13] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture. In ACCV’16, pp. 213–228. Cited by: Data Fusion for Aerial Images., Experiment Setting.
  • [14] M. A. Islam, S. Naha, M. Rochan, N. Bruce, and Y. Wang (2017) Label refinement network for coarse-to-fine semantic segmentation. arXiv preprint arXiv:1703.00551. Cited by: Fusion Refinement..
  • [15] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Training..
  • [16] X. Liu, J. Biagioni, J. Eriksson, Y. Wang, G. Forman, and Y. Zhu (2012) Mining large-scale, sparse gps traces for map inference: comparison of approaches. In KDD’12, pp. 669–677. Cited by: Map Extraction via Trajectory Data..
  • [17] X. Liu, Y. Zhu, Y. Wang, G. Forman, L. M. Ni, Y. Fang, and M. Li (2012) Road recognition using coarse-grained vehicular traces. Hp Labs. Cited by: Map Extraction via Trajectory Data., Experiment Setting.
  • [18] G. Máttyus, W. Luo, and R. Urtasun (2017) Deeproadmapper: extracting road topology from aerial images. In ICCV’17, pp. 3458–3466. Cited by: Map Extraction via Aerial Images., Experiment Setting.
  • [19] V. Mnih and G. E. Hinton (2010) Learning to detect roads in high-resolution aerial images. In ECCV’10, pp. 210–223. Cited by: Map Extraction via Aerial Images..
  • [20] H. Noh, S. Hong, and B. Han (2015) Learning deconvolution network for semantic segmentation. In ICCV’15, pp. 1520–1528. Cited by: Experiment Setting.
  • [21] B. Parajuli, P. Kumar, T. Mukherjee, E. Pasiliao, and S. Jambawalikar (2018) Fusion of aerial lidar and images for road segmentation with deep cnn. In SIGSPATIAL GIS’12, pp. 548–551. Cited by: Data Fusion for Aerial Images., Experiment Setting.
  • [22] S. Piramanayagam, E. Saber, W. Schwartzkopf, and F. Koehler (2018)

    Supervised classification of multisensor remotely sensed images using a deep learning framework

    Remote Sensing 10 (9), pp. 1429. Cited by: Experiment Setting.
  • [23] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI’15, pp. 234–241. Cited by: Model Overview, Experiment Setting.
  • [24] S. Schroedl, K. Wagstaff, S. Rogers, P. Langley, and C. Wilson (2004) Mining gps traces for map refinement. Data mining and knowledge Discovery 9 (1), pp. 59–87. Cited by: Map Extraction via Trajectory Data..
  • [25] Z. Shan, H. Wu, W. Sun, and B. Zheng (2015) COBWEB: a robust map update system using gps trajectories. In UbiComp’15, pp. 927–937. Cited by: Experiment Setting.
  • [26] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Appendix B.
  • [27] T. Sun, Z. Chen, W. Yang, and Y. Wang (2018) Stacked u-nets with multi-output for road extraction. In CVPR’18 Workshop, pp. 202–206. Cited by: Map Extraction via Aerial Images..
  • [28] T. Sun, Z. Di, P. Che, C. Liu, and Y. Wang (2019) Leveraging crowdsourced gps data for road extraction from aerial imagery. In CVPR’19, pp. 7509–7518. Cited by: Introduction, Data Fusion for Aerial Images., Experiment Setting.
  • [29] L. Zhou, C. Zhang, and M. Wu (2018) D-linknet: linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In CVPR’18 Workshop, pp. 182–186. Cited by: Map Extraction via Aerial Images..

Appendix A Quantitative Results

We first visualize the generated maps produced by all fusion-based methods introduced in the main text. The generated area has a size of 896m896m, which is synthesized by sub-regions with a size of . To avoid the edge artifacts resulted from convolution onto the edges when concatenating the sub-regions, we first extend the area of each sub-region to 448448 and then crop the 224

224 region centrally to get the prediction of each sub-region. In such a way, the convolutional filters can also process the full local information on edges rather than the paddings. The visualization results are presented in Fig. 

11. To offer a more detailed view of the quality of the generated map, we visualize the whole test area generated by DeepDualMapper in the Porto dataset a representative. The result is shown in Fig. 12.

(a) Aerial Image
(b) Trajectory
(c) DeepDualMapper
(d) Ground Truth
(e) DeepDualMapper(IoU 0.79)
(f) TriSeg (IoU 0.755)
(g) 1D decoder (IoU 0.719)
(h) V-Fusenet (IoU 0.722)
(i) Early-fusion (IoU 0.714)
(j) Late-fusion (IoU 0.747)
(k) SegNet-RC (IoU 0.728)
(l) L3Fsn (IoU 0.708)
Figure 11: Examples showing the generated map in an area of size 900m 900m. The first row shows the aerial image and the trajectories as well as the predictions from DeepDualMapper and the ground truth constructed by OpenStreetMap. The second and third rows visualize the predictions of all fusion-based methods. The red pixels indicate the road pixels that are wrongly predicted as non-roads, i.e., false negative, and the blue pixels represent the non-road pixels wrongly predicted as roads corresponding to the ground truth, i.e., false positive.
Figure 12: Visualization of the generated map by DeepDualMapper w.r.t. whole test area in Porto dataset. Similarly, the red pixels indicate the pixels being wrongly predicted as non-roads (false negative) and the blue pixels represent the non-road pixels wrongly predicted as roads (false positive). Note that we have found that a large ratio of those false positive samples (in blue) is actually resulted by the incorrectness of OpenStreetMap (served as the ground truth) which means these roads are actually missing in OpenStreetMap. It demonstrates that DeepDualMapper can reversely detect some missing roads in OpenStreetMap and helps to make it more precise.

Appendix B Implementation Details

Data Preprocessing

As introduced in the main text, we get the training sample by cropping an area of size 224224. In the following, we detail the procedure of constructing the training examples. An intuitive way to generate the training samples is cropping the whole city into disjoint grids which are like meshes. We do not adopt the mesh representation because it is not able to well preserve the information of grid boundaries. Instead, we adopt a different approach by randomly generating samples of size from the whole city, excluding the testing and validation regions. The boundary of one sample could be in the center of another sample, which effectively addresses the limitation of mesh presentation. Our training set is generated in the same way, consisting of randomly generated training samples of fixed size. To guarantee fairness, we apply the same sampling strategy to all neural network-based baselines. We normalize the RGB channels of aerial image to before feeding them to the network. We use (i.e., , and =1m/pixel) for implementation, which is the most common size for VGG-like models. However, the size is not fixed as the model uses a fully convolutional network and hence it can take in an image of any size as an input.

For trajectory feature extraction, as stated in the paper, we partition the region into many

grids to count the number of trajectory points falling in each grid. We have tried many different values for , eg., 1m, 2m, 4m, 8m, and m turns out to be the best.

As the training patch is randomly cropped from the whole area, it is hard to define what is “an epoch”, as two training samples might have overlaps. As a solution, we introduce a number

with representing the area of the training region. actually defines the expected number of samples that could cover the entire training region, and an epoch refers to the scanning of the set of samples.

Model Details

Structure of U-Net.

DeepDualMapper is designed based on U-Net. U-Net is a fully convolutional network for pixel-wise prediction through encoding and decoding stages. In the encoding stage, the spatial size of feature maps will be halved 4 times through max pooling, while the channels of feature maps will be doubled four times, which is similar to the front 8 layers in VGG network (type B)

[26]. The decoding stage up-samples the low-resolution encoded feature map to the original resolution prediction map through four deconvolutions. One important feature of U-Net is the skip-connection from the feature map extracted in the encoding stage to the corresponding layers in the decoding stage to ensure the lower-level features, like shapes and edges, to be directly inferred in the decoding stage. Moreover, the gradient path will be also shortened via the short cut to relieve the gradient vanishing in deep neural networks.

As the map extraction task (pixel-wise binary prediction) is expected to be no harder than the semantic segmentation task (pixel-wise multi-classification), we reduce the model parameters of the original U-Net for fast training and memory saving. In detail, we divide the output channels of all convolutional layers by 4. Note that we have conducted experiments to ensure that such a modification will NOT harm the performance of DeepDualMapper. To guarantee a fair comparison, we divide the channels of convolutions of VGG-like models, eg. SegNet, DeconvNet, and U-Net, by 4 too. The configurations of all layers of our modified U-Net are listed in Table 3.

name kernel size stride pad output size
image input - - -
traj input - - -
conv1-1 1 1
conv1-2 1 1
max-pool1 2 0
conv2-1 1 1
conv2-2 1 1
max-pool2 2 0
conv3-1 1 1
conv3-2 1 1
max-pool3 2 0
conv4-1 1 1
conv4-2 1 1
max-pool4 2 0
conv5-1 1 1
conv5-2 1 1
deconv4-1 2 0
concat conv4-2 - - -
conv4-3 1 1
conv4-4 1 1
deconv3-1 2 0
concat conv3-2 - - -
conv3-3 1 1
conv3-4 1 1
deconv2-1 2 0
concat conv2-2 - - -
conv2-3 1 1
conv2-4 1 1
deconv1-1 2 0
concat conv1-2 - - -
conv1-3 1 1
conv1-4 1 1
fc 1 0
softmax - - -
Table 3: Detailed configurations of the encoder and the decoder branches of our model. “conv” and “deconv” refer to the convolution layer and de-convolution layer respectively. Note that all “conv” layers are followed by a batch normalization layer and an ReLU activation. “concat ” denotes the concatenation of the outputs of the previous layer and the layer .

Gated Fusion Module.

Recall that the decoder phase of DeepDualMapper has 5 levels and it adopts the GFM at every level. Specifically, we assign individual weights i.e., to the GFM of each level. The convolutions and will not change the channel dimension; while the convolution will transfer the channel dimension to 2 to perform softmax normalization. The non-parametric upsampling operation is implemented by the nearest neighbor up-sampling strategy. Specifically, is set to , which serves as a non-informative prior.

Densely Supervised Refinement Decoding.

Our dense supervision needs to provide labels in all scales, i.e., levels . We specify the generation of the labels in levels . To get the ground truth of level with , we adopt the average pooling on the ground truth of level . It means the label is a real value in the range of , which can be explained by the ratio of having roads in its place, and the cross-entropy loss is compatible with such labels.

Output Format.

There are mainly two types of output format for map extraction task, i.e., the graph representation and the binary image representation. The former represents the map by a directional graph, usually used by the trajectory-based approaches (cobweb cobweb; tc1 tc1; kde kde; trajsift trajsift). It directly captures the topology of the road network but is hard to be optimized by end-to-end. The latter, usually used by aerial image-based approaches (hinton_mapinference hinton_mapinference; casnet casnet; robocodes robocodes; triseg triseg; 1ddecoder 1ddecoder; dlinknet dlinknet; stackedunet stackedunet), is easy to be transferred into a pixel-wise binary prediction task but loses the topological information.

In this paper, we decide to use the binary image representation as the output format. This is because even if the graph representation is necessary for certain applications, we are able to generate the graph representation from the binary images, as approaches like (deeproadmapper deeproadmapper; roadtracer roadtracer) have been proposed recently to predict the graph representation based on post-processing a first-stage binary image prediction. Moreover, the recent DeepGlobe road extraction challenge also selected such image-based output format 111

Appendix C Experiment Details


Notice that the map extraction task studied in this paper leverages both aerial images and GPS trajectories, thus we are not able to directly adopt existing datasets for evaluation. We retrieve the aerial images from Google Map API (zoom=17) with the original resolution being roughly 1m/pixel, and we resize them to 1m/pixel for simplicity. For GPS trajectory datasets, all trajectories record the journeys of taxis. The Porto dataset is an open-sourced dataset

222Available at It contains trajectories generated by 442 taxis from 2013 to 2014. The Shanghai dataset has trajectories generated by around 13,000 taxis for 3 days in 2015. The Singapore dataset is generated by about 15,000 taxis for 60 days in 2012. Table 4 lists the detailed statistics of all datasets.333 Fig. 13 visualizes the training, validation and testing areas. The testing area (in red boxes) is selected such that it consists of both dense and sparse roads to test the performance of DeepDualMapper under different scenarios.

Ground Truth.

Since the datasets used in this paper are not the public datasets whose roads have already been annotated, the ground truth is not available. To efficiently build the ground truth, we leverage the road networks from OpenStreetMap ( The latest version of maps are downloaded to ensure the completeness of the road networks. We exclude the roads belonging to the service type. Recall that in this paper, we regard the map extraction as a pixel-wise binary classification task. Thus, we need to transfer the road network into an image. In detail, we draw roads by lines at 10 pixel width to generate the image representation of the ground truth. The reason we select 10 pixel is that it represents 10 meters in the real world (as the aerial image has the resolution at 1m/pixel), which is the average road width. We have also tested the performances of DeepDualMapper using other widths while the setting of width does not affect the performance much. Notice that one may argue that our assumption of all the roads sharing the same width may be inconsistent with reality. We would like to highlight that it is not a concern as we have found ALL neural network-based models can learn such fixed-width automatically without much effort, even it is not consistent with the real width.

Dataset Porto Shanghai Singapore
Width (km) 15.447 19.500 16.000
Height (km) 13.538 14.500 12.000
Validation area percentage 3.486% 3.183% 2.083%
Test area percentage 9.998% 12.439% 13.021%
# Trajectories (k) 1,692 1,105 952
# Trajectory pts (million) 77 95 584
# Trajectory pts per 1m1m grid 0.367 0.335 3.043
# Trajectory pts per valid 1m1m grid 6.461 3.616 13.585
Trajectory sampling interval (s) 15.012 9.391 25.135
# Roads (k) 43 21 37
# Roads/ 207.1 77.6 195.2
Original image resolution (m/px) 0.90 1.02 1.19
Table 4: The statistics of three datasets under the areas shown in Fig. 13.


As introduced in the main text, we adopt IoU and F1-score as the evaluation metrics. In detail, we denote correctly predicted road pixels, i.e., true positive, as TP, correctly predicted non-road pixels, i.e., true negative, as TN, road pixels which are wrongly predicted as non-roads, i.e., false negative, as FN, and non-road pixels which are wrongly predicted as roads, i.e., false positive, as FP. The IoU is computed as . Precision is computed as and recall is computed as . F1-score, the harmonic mean of precision and recall, is computed as . Note that the numbers of TP, FP, TN, and FN pixels are counted in the whole testing area.

(a) Porto
(b) Shanghai
(c) Singapore
Figure 13: Visualization of three datasets. The regions inside the green and red boxes refer to the validation area and the testing area respectively, while the remaining regions form the training set.

Training Details

This paper is to study the data fusion methods on aerial images and GPS trajectories for map extraction. Accordingly, we want to minimize the influence of other factors. To do so, we use the same data loader to provide training and testing samples for all neural network-based models. In detail, as introduced previously, in each step, we randomly crop a

area for training for all approaches. We also adopt the same learning rate (1e-4) and optimizer (Adam) on all the models. Notice that as the input features contain not only a pure RGB image but also 1-channel trajectory data, it is hard to leverage the pre-trained weight (e.g., via ImageNet) to initialize the kernel of convolutions. Fortunately, we found that all the models are able to converge fast even when trained from scratch, thus we randomly initialize all the parameters for all the models.

There are some fluctuations in the performance of some approaches. As a solution, we train all the approaches 50 epochs (which is enough for them to converge) and report the average results of the last 10 epochs. Note that DeepRoadMapper and RoadTracer adopt their own segmentation network, and we observe that their performance converges slower than other baselines. Consequently, we train them 100 epochs and report the average performance of the last 10 epochs. Moreover, as the original DeepRoadMapper and RoadTracer both first produce the image representation of the map and then post-process it into the graph representation, we only include their segmentation networks to produce the map in image representation for a fair comparison.

We implement our code in PyTorch 0.4.1 under Ubuntu 16.04. The CPU is Intel Core i7-6850K with 128GB memories. All the models can fit in an Nvidia GTX 1080Ti GPU having 11GB memories. Our model takes about 7.6ms to infer a patch in

(with batch size 16).