Road extraction is an essential step in building autonomous navigation systems. Detecting road segments is challenging as they are of varying widths, bifurcated throughout the image, and are often occluded by terrain, cloud, or other weather conditions. Using just convolution neural networks (ConvNets) for this problem is not effective as it is inefficient at capturing distant dependencies between road segments in the image which is essential to extract road connectivity. To this end, we propose a Spatial and Interaction Space Graph Reasoning (SPIN) module which when plugged into a ConvNet performs reasoning over graphs constructed on spatial and interaction spaces projected from the feature maps. Reasoning over spatial space extracts dependencies between different spatial regions and other contextual information. Reasoning over a projected interaction space helps in appropriate delineation of roads from other topographies present in the image. Thus, SPIN extracts long-range dependencies between road segments and effectively delineates roads from other semantics. We also introduce a SPIN pyramid which performs SPIN graph reasoning across multiple scales to extract multi-scale features. We propose a network based on stacked hourglass modules and SPIN pyramid for road segmentation which achieves better performance compared to existing methods. Moreover, our method is computationally efficient and significantly boosts the convergence speed during training, making it feasible for applying on large-scale high-resolution aerial images. Code available at: https://github.com/wgcban/SPIN_RoadMapper.git.READ FULL TEXT VIEW PDF
Among all the topographic objects found in aerial images, road is one of the essential topographic features with numerous applications ranging from automatic navigation and guidance systems. Extraction of roads from aerial images helps to understand the connectivity between places and thus aid in automating navigation, disaster mitigation, and controlling traffic. Furthermore, road detection helps to determine the drivable areas for autonomous vehicles so that motion planning algorithms can be constrained on drivable roads. In addition, most of the algorithms designed for road boundary extraction and curb detection are based on road segmentation maps as the primary step [27, 25]. The extraction of road boundaries and curbs can be used to further improve the safety of autonomous driving [41, 42].
Classical methods for road segmentation involve geometric-stochastic models [4, 36], line network extraction , and snakes . There have also been works that consider the problem of road extraction as a problem of graph extraction from images [16, 14]19, 13], techniques involving ConvNets have been explored for automatic road extraction [25, 26, 10, 5]
. These works pose road extraction as a problem of semantic segmentation where one tries to classify the pixels corresponding to the road from other semantics of the image.
Segmenting roads from aerial images is not a straight-forward segmentation problem because roads appear at different scales in the image due to varying widths and certain road regions are often narrow and get occluded with respect to the terrain. Also, there exists some similarity of the road texture with respect to nearby regions and there are chances of occlusion due to clouds and various weather conditions. One major problem of using ConvNets directly for road segmentation is that they are not good at learning long-range dependencies due to their inherent inductive biases. In aerial images, the road structure is mostly branched throughout the image as road is a connected topography. Also, just using a ConvNet does not constrain the network to learn representations of connected road segments . These issues make road segmentation from aerial images an open and challenging problem.
In this work, we focus on improving road segmentation by incorporating a global understanding of the image. Modeling dependencies and relations over regions in the image can help in understanding connectivity between the road segments. We note that transformer-based methods  are currently becoming popular for their property of extracting long-range dependencies. However, it is not feasible for applications on large-scale high-resolution remote sensing datasets as it requires high compute power and significant training time. Thus, we propose using graph reasoning rather than just relying only on stacked convolutions or transformers to model global dependencies. Performing graph reasoning is light-weight and does not add on much to computation cost like transformers.
A graph convolution  can extract dependencies over distant regions making it more meaningful for using it to understand road information on a global scale in aerial images. Graph convolutions have been explored for video recognition , semantic segmentation  and semi-supervised classification . Unlike these works, we propose performing graph reasoning in two domains - spatial and interaction space. In graph reasoning over spatial space, we build a graph over the feature space to extract dependencies between different spatial regions in the input. As we operate on the original coordinate space, performing reasoning over the graph would help to extract rich contextual information for road segmentation. For graph reasoning over interaction space, we construct a new interaction space where we model semantics with similar information together. This causes different semantic objects of the aerial image like roads, buildings, clouds, trees, and other topographic features to be modeled into different spaces. Performing graph reasoning over this interaction space would help in appropriate delineation of roads from other topographies in the image. Combining both, we propose a stand alone Spatial and Interaction space (SPIN) graph reasoning module which performs reasoning in the spatial and interaction space projected from the feature maps. Fig 1 illustrates how SPIN module helps to make road segmentation better.
SPIN extracts long range dependencies between road segments and is effective at delineating roads from other semantics present in the image. When added to a base network, we show that it improves the segmentation performance by a reasonable amount. It has numerous other advantages as well. SPIN can be plugged easily into a ConvNet architecture after a convolutional block. As SPIN learns highly contextual information, it increases the convergence rate of the network by half saving a lot of training time. This property is highly useful for training ConvNets on large-scale high-resolution images like aerial images. Adding SPIN to a ConvNet is also computationally effective as it adds on only
parameters. Our proposed network consists of a feature extractor using residual blocks, stacked hourglass modules with skip connections for deep feature extraction and SPIN pyramid for graph reasoning. We analyze the effectiveness of our proposed method by conducting experiments on two large-scale road segmentation datasets - DeepGlobe and Massachusetts Road  where we achieve a better performance than existing methods in the literature.
In summary, this paper makes the following contributions:
We propose a new module - Spatial and Interaction Space Graph Reasoning (SPIN), which when plugged into a ConvNet performs reasoning over graphs constructed on spatial and interaction space projected from the feature maps.
We propose a new network built using stacked hourglass modules and SPIN pyramid for road segmentation from aerial images.
We conduct extensive experiments on large-scale road segmentation datasets where we achieve better performance than existing methods both qualitatively and quantitatively.
Our SPIN module is highly computationally efficient and helps in fast network convergence which makes training ConvNets on high-resolution aerial images quick and effective.
Road segmentation: Road segmentation is a well-studied problem in which we classify each pixel in a given aerial image as “road” or “no road” . Early research on road segmentation primarily relied on probabilistic models to enhance connectivity by combining contextual prior conditions such as road geometry [34, 33] and color intensity . In 39], a model based on high-order conditional random fields (CRF) was used to incorporate prior knowledge of roads. However, these traditional probabilistic methods require hand-designed features and complex optimization techniques .
One of the earliest attempts to automatically learn features for detecting roads in aerial images using expert labeled data was proposed in 
. In this study, unsupervised learning methods such as Principal Component Analysis (PCA) was used to initialize the feature detectors. Later, with the introduction of ConvNets in deep learning, researchers have investigated various ConvNet architectures to efficiently extract roads from aerial images. Among those, encoder-decoder based architectures are widely used due to its ability to capture relatively large spatial context [1, 5]. Examples of these include U-Net , LinkNet , ResNet18  and multi-branch ConvNets [6, 31]
. In addition to the architectural changes, researchers also investigated different types of loss functions to replace well-known binary cross-entropy loss (BCE) to further improve the quality of road proposals and to incorporate topological constraints. In, a differentiable IoU loss function was proposed and most of the later work on road segmentation then used it instead of the BCE loss or combined them together for obtaining improved performance. Instead of just formulating the road extraction as a binary segmentation task,  introduced a multi-task learning  approach where both segmentation and orientation of road line segments are used to improve the connectivity of the predicted road networks.
The proposed SPIN module is shown in Figure 2-(a). We build graphs in two spaces: spatial space and a projected latent interaction space from input feature maps. Then, graph reasoning is performed in spatial space to improve the connectivity between road segments and interaction space to delineate roads from other topographies. Assuming that spatial and interaction space graph reasoning provide different feature representations, we concatenate the output feature maps of both graph reasoning modules, as shown in Figure 2-(a), to extract rich global contextual information of road segments. In order to capture multi-scale context of input feature maps, we build SPIN pyramid by performing SPIN graph reasoning on different scales and then aggregate them as shown in Figure 2-(b).
In what follows, we elaborate on each block of the SPIN module in detail. Before that, we briefly review graph reasoning.
Graph reasoning: A graph is defined by its nodes , edges and similarity matrix that describes the similarity between each and every pixel (node) in the graph. Let denote the input feature map where is the number of channels and . Here, and correspond to the width and height of . Standard 2D convolutions only share information among the positions in a small neighborhood defined by the filter size. In order to achieve a large receptive field and to capture long-range dependencies among the pixels, ConvNet architectures stack multiple convolution layers which is highly inefficient. In contrast, a single graph convolution layer can extract long-range dependencies of input feature map very efficiently and effectively. Formally, the graph convolution is defined as ,
where is the learnable weight matrix (usually modeled as a convolutional layer),and are the same as defined above. Note that the only difference between graph convolution and conventional convolution is that in graph convolution, we left-multiply the original feature map by the similarity matrix before doing the convolution operation.
With this background of graph reasoning, we now describe the two main building blocks of our proposed SPIN module: (1) Spatial space graph reasoning and, (2) Interaction space graph reasoning.
The overall procedure of spatial space graph reasoning is depicted in the red box in Figure 2-(a). As described in the previous section, the main intuition behind spatial space graph reasoning is to improve the connectivity between the predicted road segments. We first build a fully-connected graph in the spatial domain using the spatial similarity matrix and then perform spatial graph reasoning. We now describe the computation procedure of spatial graph reasoning in detail.
Computation of spatial similarity matrix : The first step of spatial graph reasoning is to compute the spatial similarity matrix . There are different similarity metrics that have been proposed in the literature to calculate the similarity between two given pixels. The most popular are the Euclidean distance and the dot product. In our implementation, we use the dot product to compute the similarity matrix .
The similarity matrix for an input feature map can be represented as a multiplication of three transformations as follows:
is a linear transformation followed by ReLU non-linearity andis the diagonal matrix. Note that is the dimension of the intermediate feature map.
In this implementation, the linear transformation is modeled using a convolution layer that reduces input feature map dimension from to . The transformation is represented by a global average pooling followed by a convolution. Then we reshape the outputs and appropriately to perform matrix multiplication as shown in Figure 2-(a) to obtain the similarity matrix .
Graph reasoning in spatial space: Once we have the similarity matrix , we can perform the spatial graph reasoning on input data according to the Eq. (1). First, we reshape the input data appropriately and then we perform the matrix multiplication to obtain . Next, we multiply it by the trainable weight matrix that is modeled as a convolution layer as shown in Figure 2-(a). Finally, we apply ReLU to obtain the spatial graph reasoned feature matrix .
The overall procedure of interaction space graph reasoning is shown in the green box in Figure 2-(a). As we described earlier, the spatial space graph reasoning can improve the connectivity between predicted road segments. We now consider projecting the input feature space into another latent space, called the interaction space , where we try to delineate roads from other objects such as buildings, trees, vehicles, etc. Next, we build a graph that connects these features in the interaction space and performs a relational reasoning on the graph. After reasoning, the updated information is projected back to the original coordinate space. In what follows, we discuss these operations in detail.
Projection to interaction space: The first step is to project the original feature map to the interaction space . This is done by the projection function such that the features in the interaction space are more friendly for global reasoning over disjoint and distant regions, where is the number of nodes and is the number of states.
In practice, we first reduce the dimension of the input feature with the transformation and formulate the projection function as a linear combination of input such that the new features can aggregate information from multiple regions. Concretely, the input feature is projected as in the interaction space through the projection function as follows:
We implement both functions and as convolutional layer as shown in Figure 2-(a).
Graph reasoning in interaction space: After projecting the input feature space into interaction space, we build a fully-connected graph in the interaction space with the node similarity matrix . The similarity matrix
is randomly initialized and learned during back propagation in contrast to the similarity matrix we defined for the spatial graph reasoning that is dependent on the input data. In addition, we use skip connection (i.e. identity matrix) that speeds up the optimization. Following Eq. (1), the graph convolution in the interaction space is formulated as:
where is the trainable weight matrix. Here both matrices and are implemented as 1D convolution with kernel size of 1 as shown in Figure 2-(a).
Reverse projection to the original coordinate space: After graph reasoning in the interaction space, we project the output features to the original coordinate as:
We use the same projection matrix to transform features to . Then we perform linear projection using a convolution layer to transform into the original coordinate space. As a result we have the features with feature dimension at the original coordinate space.
Once we have the spatial and interaction space graph reasoning outputs, we combine them with the original input feature map and then apply ReLU non-linearity to get the final graph reasoned feature map . Mathematically, we can denote this as,
We perform our SPIN graph reasoning at multiple scales to further increase the overall receptive field of the network and to improve long-range contextual information present in the intermediate feature maps. Concretely, we perform SPIN graph reasoning at different scales () of the original feature map as shown in Figure 2-(b). In the results and discussion section, we conduct an ablation study to demonstrate the effect of spatial, interaction, and SPIN graph reasoning on the segmentation performance.
Feature extractor block: Operating the network at high-resolution (i.e. ) requires a large GPU memory and computational power. Therefore, we employ a. We then add two subsequent residual modules before sending it to the hourglass module.
where denotes the scale having values , and are the predicted and ground-truth segmentation maps at scale , respectively. Similarly, we calculate the orientation loss at multiple scales. The orientation loss is defined as follows,
where is the number of bins in the quantized orientation, and are the predicted and ground-truth orientation maps of orientation bin and scale , respectively. Finally, the overall loss function is defined as follows,
Massachusetts road dataset: The Massachusetts Roads dataset  consists of train, validation and test sets with 1108, 14 and 49 images, respectively, each with a size of pixels. Following , we fill the training images into size of and then we crop each image into patches with overlapping window of pixels to make the training set. We observed that some parts of the images in the Massachusetts Road dataset are partially occluded and these images severely affect the performance of models. Hence, we removed these occluded images from the training set. Similarly, we crop each image in validation and test sets into patches without any overlapping window. After these series of operations, the processed Massachusetts Road dataset contains , , and images with size of , corresponding to the train, validation and test set, respectively.
DeepGlobe dataset: For the DeepGlobe dataset , we follow the same experimental and data preparation protocols mentioned in . The DeepGlobe dataset consists of images with resolution of . Following , we create splits of images for training and images for testing. Then, we create the patches with resolution with an overlapping window of 256 pixels and this results in total of images for training. Similarly, for the testing dataset also we create patches with resolution of without any overlapping pixels and this results in total of images.
|Method||Massachusetts Road Dataset ||DeepGlobe Dataset |
|Batra et al. ||83.34||84.61||83.97||72.37||64.44||71.34||83.79||84.14||83.97||72.37||67.21||73.12|
|SPIN Road Mapper (ours)||83.90||85.06||84.47||73.12||65.24||72.49||84.14||84.50||84.32||72.89||67.02||74.14|
We use random crops of resolution to train the network for the Massachusetts and DeepGlobe datasets. We use extensive data augmentation techniques such as image rotation, flipping, and mirroring. We use SGD optimizer with a batch size of , a momentum of and a weight decaying of . We use a step-learning rate scheduler with an initial learning rate of where steps are scheduled at , , and . We reduce the learning rate by a factor of at each step. We train the network for a total of epochs. We implemented our model in PyTorch and used an NVIDIA Quadro RTX 8000 GPU for all of our experiments.
For the DeepGlobe dataset accurate road segmentation masks are available and hence, we evaluate the quality of our road predictions using accurate road Intersection over Union () and score. However, the groundtruth segmentation masks of Massachusetts road dataset have constant width and this will adversely affect the pure pixel based metrics. So, as proposed in  we also use relaxed IoU () with buffer size of in our evaluations. Furthermore, we use Average Path Length Similarity (APLS) metric to measure the difference between ground truth and proposal graphs .
In this section, we compare the road segmentation performance of our SPIN Road Mapper with existing methods, quantitatively and qualitatively. In particular, we compare the performance of our method with that of Seg-Net , U-Net , LinkNet34 , HourGlass , Stack-HourGlass , and Batra et al. .
Quantitative results: The quantitative results are summarized in Table I. As can be seen from Table I, the proposed SPIN Road Mapper achieves the state-of-the-art (SOTA) results in terms of all the performance measures for the Massachusetts dataset. When considering the DeepGlobe dataset, our method achieves the SOTA results in terms of F1, , and APLS. Furthermore, the improvement in terms of APLS metric is significant (+1.15% and +1.02% for Massachusetts and DeepGlobe datasets, respectively) and which confirms that the proposed SPIN module improves the connectivity of road segments by specially bringing gaps for occluded areas (see Fig. 5). These qualitative results confirms that the proposed SPIN module effectively captures the long-range dependencies of feature maps, and thereby helps final classifier to identify road pixels substantially well compared to the available ConvNet architectures for road segmentation.
Qualitative results: For the qualitative analysis, we visualize the predicted road maps from SegNet , LinkNet , Stack-HourGlass , Batra et. al , and our SPIN Road Mapper on the Massachusetts Road dataset in Figure 4. The red boxes in Figure 4 highlight the regions where our method performs better than the baseline methods. For example, consider the last row of Figure 4. Roads in the region highlighted by the red box are mostly covered by trees and buildings (as can be seen from the input aerial image), making it difficult for the baseline segmentation networks to correctly identify the presence of roads. In contrast, our method is able to predict most of the road segments due to its ability to capture long-range dependencies between road pixels through spatial graph reasoning, as well as its ability to delineate roads from surrounding structures through interaction space graph reasoning.
Ablation study: We conduct an ablation study to demonstrate the effect of spatial, interaction and SPIN graph reasoning on road segmentation. It can be seen from Table II that integrating spatial and interaction space graph reasoning to the ConvNet-based network results in increase road segmentation accuracy. Combining the spatial and interaction space graph reasoning together in SPIN pyramid results in further improvement over the individual components. In addition to the quantitative comparison, we also present a qualitative comparison in Figure 5 which clearly demonstrates how each graph reasoning technique improves the quality of road predictions. These experiments show that out proposed SPIN pyramid helps the network learn features with more global contextual information resulting in an improved performance.
|ConvNet + Spatial GR||84.13||66.82||73.59|
|ConvNet + Interaction GR||84.12||66.76||73.52|
|ConvNet + SPIN GR||84.32||67.02||74.14|
Convergence: Figure 6 shows the training convergence plot of the proposed network with and without the SPIN module. We can observe that adding the SPIN module helps achieve faster convergence. This leads to reduction in training time which is crucial for training ConvNets on large-scale high-resolution remote sensing datasets.
We presented a Spatial and Interaction Space Graph Reasoning (SPIN) module that can be plugged into ConvNets to learn distant relationships between road segments in the feature space. Learning global dependencies are essential while extracting complex road topology from aerial images where most of the road segments are partially or completely occluded by trees, buildings, or clouds. The graph reasoning over the spatial space helps the network to extract more dependencies between different spatial regions and other contextual information whereas graph reasoning over a projected interaction space helps to delineate roads from surrounding objects. We conduct extensive experiments and compare the predicted road maps qualitatively and quantitatively with existing methods. We observe that our SPIN module helps convolutional networks to extract long-range dependencies and thereby improve the segmentation quality. SPIN is computationally light and also helps in faster convergence which are crucial while training ConvNets on large-scale high-resolutions datasets.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4720–4728. Cited by: §I, §II, §II.
Proceedings of the 29th International conference on machine learning (ICML-12), pp. 567–574. Cited by: §IV-C.
Stacked hourglass networks for human pose estimation. In European conference on computer vision, pp. 483–499. Cited by: §II, §III-B, TABLE I, §V.
ICurb: imitation learning-based detection of road curbs using aerial images for autonomous driving. IEEE Robotics and Automation Letters 6 (2), pp. 1097–1104. Cited by: §I.