Synthetic Aperture Radar (SAR) has been widely used in Earth observation applications due to its capability to work under all weather and daylight conditions. The semantic segmentation of Polarimetric SAR (PolSAR) images, namely the pixel-wise classification of PolSAR images according to ground surface types, is beneficial to a large number of remote sensing applications (e.g., urban area management, disaster monitoring and land-cover mapping).
In recent years, with the rise of convolutional neural networks (CNNs) many methods have been developed for the semantic segmentation of natural images [zhao2017pspnet][chen2018deeplabv3+] and remote sensing images [ding2020lanet][mou2019relation]. However, a limited number of studies has been conducted on the semantic segmentation of PolSAR images based on deep CNNs [geng2020multi]. There are two major barriers: i) PolSAR images contain intense speckle noise due to the coherent imaging mechanism of PolSAR systems. This speckle noise is a severe challenge for the automatic segmentation algorithms. ii) Large amount of training data is required to train an effective deep CNN. The manual annotation of SAR data is not only labor-intensive but also difficult, since the ground objects in SAR images can hardly be recognized by human observation without assisting data and expert knowledge [duan2018multi].
Most of the literature works on the semantic segmentation of SAR images are designed under the assumption that there are not enough labeled samples to train a CNN. In [li2018novel]
, a relatively small portion of pixels of the image to be classified are used for the training phase. The Fully Connected Network (FCN) has been used in a pipeline that contains sliding-crop and sparse coding operations. In[wang2018hierarchical], the FCN is combined with the sparse and low-rank subspace representations to alleviate the problem of insufficient training data. In [duan2018multi], a multi-scale CNN has been proposed for the semantic segmentation of SAR images. To solve the problem of lacking ground truth segmentation maps, it uses image scene labels for the training. In [cao2019pixel], the FCN is used in the complex domain of SAR data to include the phase information. In [wu2019polsar], extra datasets are used to pre-train CNN models for the semantic segmentation of SAR images. Since all these studies have been conducted on small datasets (most datasets contain only a single image), their objectives are mainly to reduce the dependence on training data keeping high generalization capabilities.
Recently, with the emergence of several benchmark datasets, new developments on the CNN-based semantic segmentation of SAR images have been proposed. In [mohammadimanesh2019new]
, a ’encoder-decoder’ CNN network with inception modules and skip connections is introduced for the semantic segmentation of wetland PolSAR images. In[yue2020MANet]
, a multi-scale attention based FCN (MANet) is presented combining multi-scale feature extraction and the attention mechanism. In[wang2020hrsarnet], a small yet efficient network (HR-SARNet) is proposed for the semantic segmentation of high-resolution SAR images. However, most of these literature studies are based on shallow CNNs to avoid over-fitting problems. Since shallow CNNs are not powerful enough to extract high-level semantic information, their accuracy is limited. Differently from previous studies, this work aims to improve the semantic segmentation of high-resolution PolSAR images by proposing a novel network architecture, the Multi-path Residual Network (MP-ResNet). This network enables a multi-scale modeling of high-level semantic features through its parallel branches, which strengthens the learning of local discriminative features and reduces the effects of speckle noises. This network is based on a large benchmark dataset on PolSAR images made available by the Gaofen committee. Ablation studies show that the proposed approach achieves an large increase of 0.36% and 0.64% in terms of average OA and fwIoU, respectively, compared to the baseline FCN. It also surpasses several literature works in the comparative experiments.
The key to improve the semantic segmentation of PolSAR images is to learn discriminative features from a larger image context, so that the effects of speckle noises can be mitigated. Based on this idea, we propose the Multi-path Residual Network (MP-ResNet) shown in Fig.1. This network exploits a baseline fully convolutional network (FCN). It is inspired by the multi-path convolution design in high-resolution network (HRNet) [wang2020hrnet] and the multi-scale feature fusion design in LinkNet [chaurasia2017linknet]. In this section we introduce in details the design of the proposed MP-ResNet.
Ii-a Quick Embedding of Shallow Features
PolSAR data generally contain 4 channels which are related to the 4 combination of linear polarizations (i.e. the HH and VV co-polarized channels and the HV and VH cross-polarized channels). The corresponding 4 images are stacked and given as input to a CNN. Since CNNs are capable of extracting semantic features from raw input data, no extra filtering operations have been applied to the input images. Instead, we merely apply maximum-suppression and normalization operations to stabilize and squeeze the value range of input data.
It is well-known that the signal at pixel level in PolSAR images is generally very noisy. The most discriminative features in these images are the signal values and texture patterns in local areas. This makes it necessary to down-sample the features in early layers of a CNN and embed the features from a wider range. Considering this, we adopt the ResNet as the backbone feature extraction network. The first layers of ResNet are strided convolutions followed by a pooling layer, which quickly decreases the scaling rate of features to 1/4. Additionally, while 33 convolutional kernels are widely used in CNNs to avoid over-fitting problems, the kernel size of the first convolutional layer in the ResNet is 77. This helps the network to aggregate features from a larger pixel neighbourhood and alleviate the effect of speckle noises. After the first two residual modules of ResNet, the scaling rate of features is quickly decreased to 1/8. Considering that most of PolSAR images are not high-resolution and thus do not contain rich spatial details, this feature size is large enough for modeling the regional land-cover information. Therefore, 1/8 is set as the fundamental scaling rate for the aggregation of context information in the proposed network.
Ii-B Multi-path Semantic Information Embedding
The size of valid receptive fields (VRFs) is known to be crucial to the embedding of context information in CNNs [zhang2018context]. In the semantic segmentation of remote sensing data, the size of VRFs determines the spatial range from which the CNN can exploit discriminative features, which is related to the granularity of semantic segmentation results [ding2020twostage]. Although the serial connection of pooling and strided convolutions can greatly enlarge the VRFs, it also brings the problem of losing spatial information. Therefore, how to simultaneously enlarge VRFs and preserve spatial information is one of the most crucial bottleneck problems in semantic segmentation tasks. PSPNet [zhao2017pspnet] and DeepLab [chen2018deeplabv3+] managed to enlarge VRFs without severe loss of spatial information through the use of additional context aggregation modules in the late layers of the CNNs. However, the context information is aggregated through pooling and dilated convolutions in these designs, which are less effective than stacked convolutional layers. The dilated convolutions may also cause gridding effects and enlarge computational costs.
Alternatively, HRNet presents a multi-path architecture that organizes multi-scale convolutional layers in a parallel manner. The highest scaling rate of HRNet is 1/32, which ensures a large VRF of the network. However, the serial convolutions in its parallel branches greatly increases the computational costs of this network. They may also cause over-fitting problems on small datasets. In addition, in HRNet the multi-scale feature branches are concatenated together to generate the semantic segmentation results, thus the features are not fully fused and utilized.
Inspired by the parallel feature embedding design of the HRNet, The proposed MP-ResNet uses ResNet as the feature extraction network but has 2 additional encoding branches after the second convolutional block. Therefore, the features are encoded both forwardly (size remain the same) and downstream-wise (size reduced). The same residual blocks are duplicated into the parallel branches so that each encoding branch has the same amount of convolutional layers. In this way, each encoding branch contains rich semantic information.
Differently from the HRNet, the parallel branches in MP-ResNet focus on the embedding of high-level features (the lowest scaling rate is 1/8). Although the parallel embedding of larger feature maps are potentially feasible in other applications, for the semantic segmentation of PolSAR images our objective is to exploit discriminative features from a wider range. In this way, the segmentation results become less sensitive to pixel noise. The computational costs of MP-ResNet is only slightly higher than ResNet and far less than the HRNet (see discussion in Section III). Another significant difference of the MP-ResNet is that a decoder network is employed to fuse the features learned from its parallel branches.
Ii-C Fusion of Multi-scale Features
Decoder networks are commonly used in semantic segmentation to recover the spatial details of encoded features. A common design is to concatenate or add the encoded features with the features from early layers of the encoder networks (e.g. UNet[ronneberger2015unet] and SegNet[badrinarayanan2017segnet]). Although this design can aggregate spatial information from low-level features, it also introduces redundant information (minor details and noises). For the segmentation of high-resolution PolSAR images this problem can be critical. However, in the proposed MP-ResNet, the fused multi-scale features are the ones encoded by the parallel branches of the encoder. These features contain rich semantic information, thus their fusion does not lead to noise problems. A feature deconvolution module [chaurasia2017linknet] is introduced to enlarge the spatial size of features from higher branches before the fusion. Fig.2 shows this spatial deconvolution block. It adopts a channel-wise ’Bottleneck’ design to reduce the computation and refine the crucial semantic information. In this way, the multi-branch features are fused in a coarse-to-fine manner in the decoder.
Iii Experimental Results
Iii-a Dataset Descriptions and Evaluation Metrics
The experiments dataset of this study are developed on the Gaofen dataset provided by the ’2020 Gaofen contest on automated high-resolution earth observation image interpretation’. The PolSAR images are collected from the Gaofen-3 satellite. Their ground sampling distance is between 1m and 3m. The ground truth maps are annotated according to 5 land-cover types: background, built-up area, vegetation, water and bare soil. The accessible training data are 500 pairs of PolSAR images and label maps each with 512512 pixels. The testing data are not visible to users, but a scoring system is provided to evaluate the uploaded algorithms.
We adopts 3 metrics for the evaluation of semantic segmentation results. They are overall accuracy (OA), F1 score and frequency weighted intersection over union (fwIoU). OA is the ratio of the number of correctly classified pixels among all pixel numbers. F1 is calculated as:
FwIoU is the evaluation metric suggested by the contest organizer. It is calculated as:
where N is the number of total classes, denotes the number of i-th class pixels that are classified into the j-th class.
Iii-B Multi-fold Ablation Study
To quantitatively evaluate the improvement of the proposed MP-ResNet over the baseline method (FCN), we conduct a multi-fold ablation study on the Gaofen dataset. The training and validation sets are randomly divided from all available training data (500 image pairs) with a numeric ratio of 9:1. In this way, a total of 10 training and validation sets are obtained. To reduce the effects of random factors, the ablation study has been conducted on all the 10 training-validation sets. The results are reported in table I. Compared to the baseline method (FCN), the proposed MP-ResNet shows average improvements of 0.36%, 0.54% and 0.64% in OA, mean F1 and fwIoU, respectively. The improvements have also been verified on the contest test set which is not directly available. According to the fwIoU scores obtained from the contest, the proposed MP-ResNet has an increase of 1.23% in fwIoU compared to the FCN.
|Datasets||FCN (Baseline)||MP-ResNet (Proposed)|
|Test set||-||-||69.42||-||-||70.65 (+1.23)|
Iii-C Comparative Experiments
To further assess the improvements brought by the proposed MP-ResNet, we compared it with several literature works. Apart from the baseline method FCN, several well-established methods in the field of semantic segmentation have been considered, including the SegNet [badrinarayanan2017segnet], the UNet [ronneberger2015unet], the PSPNet [zhao2017pspnet] and the DeepLabv3+ [chen2018deeplabv3+]. Since the proposed MP-ResNet is inspired by the multi-path architecture of HRNet[wang2020hrnet] and the feature fusion design of LinkNet[chaurasia2017linknet], we also included these networks in comparison. Moreover, several literature methods presented for the semantic segmentation of SAR images have also been tested, including the mult-scale FCN (MS-FCN) in [wu2019polsar], the Inception FCN (Inc-FCN) in [mohammadimanesh2019new] and the HR-SARNet [wang2020hrsarnet]. The training and validation sets used in the comparison are these related to the first row of table I
. For fairness, the same parameter settings have been applied during the training process (e.g. training epochs, batch size, learning rate).
Table II presents the quantitative results obtained on the Gaofen dataset. Due to the effects of speckle noise, the performances of shallow networks (SegNet, HR-SARNet, Inc-FCN and UNet), which rely heavily on low-level features, is unsatisfactory. Although there is also a multi-scale design in the MS-FCN, there is no enhancement in its branches, thus its accuracy is lower than that of the FCN. The simple FCN without any sophisticated design ranks at the 3rd place. Although HRNet shows better performance in the semantic segmentation of VHR optical remote sensing images [zhang2020multi], its size is too large to be fully trained on the Gaofen dataset, thus its accuracy is far lower than that of the FCN. LinkNet also has a deep encoding network (ResNet34) but it has skip connections with low-level features, which introduce noise and degrades its accuracy. The PSPNet has an additional multi-scale average pooling head compared with the FCN, which does not prove to be effective on the PolSAR dataset. The multi-scale atrous convolution head in DeepLabV3+ increases the mean F1, OA and fwIoU by 0.57%, 0.30% and 0.36%, respectively, compared to the FCN. The proposed MP-ResNet achieves the best accuracy in nearly every metrics except the F1 of the ’others’ class. Its improvements over the DeepLabV3+ are 0.57%, 0.57% and 1.13% in mean F1, OA and fwIoU, respectively. This proves the effectiveness of the proposed network with which we have won the 2nd place in the ’Gaofen Challenge’ contest.
|Water||Built-up area||Industrial area||Grassland||Barren||Others|
Fig.3 shows a comparison of the segmentation results on several sample areas. Due to its multi-path modeling and multi-scale feature fusion design, the proposed MP-ResNet is capable of modeling context information from a wider image range. Therefore, some critical areas for other networks are correctly segmented and the detected object boundaries are more continuous.
|PolSAR images (pseudo color)||Ground truth||FCN||Deeplabv3+||MP-ResNet (Proposed)|
|Methods||Params (Mb)||FLOPS (Gbps)|
Table III presents the sizes and computational costs of the compared methods. The floating point operations per second (FLOPS) are calculated based on the input size [4, 512, 512] of the images in the Gaofen dataset. The literature methods for the semantic segmentation of SAR images (HR-SARNet and Inc-FCN) are generally shallow, thus and their sizes are relatively small. UNet, Inc-FCN and SegNet need the most FLOPS since they apply many convolution operations on the early-layer features. Although the parameter size of the MP-ResNet is much larger than the FCN, its FLOPS do not increase significantly.
The semantic segmentation of PolSAR images is challenging due to the intense speckle noise and the lack of large training datasets. Taking advantage of the open dataset from the Gaofen contest, we propose a Multi-Path Residual Network (MP-ResNet) for the semantic segmentation of high-resolution PolSAR images. Compared to the baseline FCN, the MP-ResNet has three parallel semantic embedding branches to strengthen the aggregation of context information. It also adopts a multi-scale feature fusion design in its decoder to take advantage from each encoding branch. As a result, the VRF of the MP-ResNet is significantly enlarged, thus allowing the aggregation of discriminative features from a large range and alleviating the impact of noise. The multi-fold ablation study conducted on the Gaofen dataset has proved the effectiveness of our designs. The comparative experiments with several state-of-the-art methods show that the proposed method has a significant improvement in all accuracy metrics.
Since the objectiveness of this work is to propose a general architecture for the semantic segmentation of PolSAR images, we did not add sophisticated designs. However, in future works we plan to combine additional context aggregation modules to improve the performance (e.g., dilated convolutions, attention mechanisms).